This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the apphcant. 

Defects in the images may include (but are not limited to): 



BLACK BORDERS 

TEXT CUT OFF AT TOP, BOTTOM OR SIDES 
FADED TEXT 
ILLEGIBLE TEXT 
SKEWED/SLANTED IMAGES 
COLORED PHOTOS 

BLACK OR VERY BLACK AND WHITE DARK PHOTOS 
GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



,c/ 4 „.■■■ 



4; 



App Serial #08/769,052 Ext^ r 

Doooho 8t aL LEX-01184JSA 
Polynucteotiiios Encoding A Human Nitribse Polypepbdo (As Previously 



1 



jKe ^iq uence of ;the; Hu ntan 



i Jt^Y>3 i-tff^r;r*,\?; -if.fv'.f-v 



Peter W; Li,'' Richara J/ Mirt-al/ 



•J/Craig Venter,"** Mark D. Adams,'' Eugene W. Myers, 
/ Granger G. Sutton,^ Hamilton O. Smlth,^ Mark Yandell,\Cheryl A/ Evans,'' Robert 
^>fg ^v^k; jearinin^D. Gotayne;"' Peter Amanatides,'' Richard M/ Ballew,t Daniel H/Hus^^ _ 
- Jennifer Russo Wbrtman,^ Qing Zhang,^ Chinriappa D. Kodira,^ Xiangqun H. Zheng,'' Lin Chen^ 

V ,1,^ Marian Skupsid,'' Cangadharan Subramanian,'' Paul D. Th6mas,1jinghui Zhang,'' b i /; or ^ v^i^-riH:;^ 
; . George L Gabor Miklbs,^ Catherine Nelson,^ Samuel Broder,"* Andrew G: Clark,^ Joe Nadeau,^ > r^,^^ : 

victor A- McKusick,^ Norton Zlnde^^ Arpold J/ Leyine,^ Richard j. Roberts,® Mel Simori,^ - ^ . - 
Carotj^ Slayman,^ ? Micha ^ Randall Bolarios^^ Arthurrbelche^^^ Daniel Fasulo^^ 

Michiael Flahigan,^ Liliana Florea,"* Aaron Halpern,'' Sridhaf HannenhalU^ Saul Kravitz,'' Sai^muel Levy»^ 
Clark Mobairry,^ Kniit Reinert,'' Karin Remington,'? Jane Abu-Threideh,'' Ellen Beasley;'' Keh^Ira Biddic^^^ 
Vivien Bonazzi,^ Rhonda Brandon,^ Michele Cargill,'' Ishwar ChandramouUswaran,'' ,Rosanf Charlab,Tl J 
Kablr Chaturvedi,'' 'Zubm Valentina Di Francesco.T Patrick Dunn,"" Karen Ellbeck,"'- 

Carlos Evangelista,'? Andrei E. Gabriellan,^ Weiniu Gah,^ Wangmao Ge,'' Fangcheng Gong,"' Zhiping Gu,'' 
Ping Guan,"' Thomas J. Heiman,'' Maureen E. Higgins,'' Rui-Ru Jl,"* Zhaoxi Ke,^ Karen A. Ketchum,'* 
Zhongwu Lai,^ YIding Lei,^ Zhenya Li,T Jiayin Li,^ Yong Liang,"* Xiaoying Lin,'' Fu Lu,'' 
Gennady V: Merkulov.'' Natalia MiUhlna,\Helen M. Moore,^ Ashwinikum 
Vaibhav A. Narayan,'' Beena Neelam,'' Deborah Nusskern,'' Douglas B. Rusch,^ Steven Salzberg,''^ 
Wei Shao,^ Bixiong Shue,'' Jingtao Sun,'* Zhen Yuan Wang.T Aihui Wang,** Xin Wang,^ Jian Wang,^ 
Ming-Hui Wei,;" Ron Wides,^^ Chunlin Xiao,^ Chunhua Yan,^ Alison Yao,^ Jane Ye,T Ming Zhah,'' : 
Welqing Zhanjg,*' Hon^u Zhang,^ Qi Zhab,*^ Liansherig Zheng,"' Fei Zhq^g,'* Wenyan^hong.^^^^ v';^ 
Shiaoping C. Zhu,"* Shaying Zhao,^^ Dennis Gilbert,'' Suzanna Baumhueter,'' Gene Spier,'' 5 " ^ 
Christine Carter,^ Ahibal Cravchik,'' Trevor Woodage,'* Feroze Ali,'' Huijin An,'' Aderonke Awe,'' ^ 

Dan ita Baldwin,^ Holly Baden,'' Mary Barnstead,^ Ian Barrow,^ Karen Beeson,^ Dana Busam,'' ] ' - 
Amy Carver,** Angela Center,'' Ming Lai Cheng,^ Liz Curry,^ Steve Danaher.** Lionel Davenport,'* 
Raymond Desilets,** Susanne Dietz,'* Kristina Dodson,'* Lisa Doup,^ Steven Ferriera,^ Neha Garg,"* 
Andres Gluecksmann,"* Brit Hart,'* Jason Haynes,"* Charles Haynes,^ Cheiyl Heiner,^ Si^^^^ 

Damon Hostin,"* Jarrett Houck,"' Timothy Hbwland,'* Chinyere Ibegwa'm,'' Jeffery jbhrisbSi^^ 

Francis Kalush,^ Lesley Kline,^ Shashi Koduru,^ Amy Love,^ Felecia Mann,^ David May,^ 
Steven McCawley,^ Tina Mcintosh,"* Ivy McMullen,'* Mee Moy,** Linda Moy,^ Brian Murphy,** 
Keith Nelson,'' Cynthia Pfannkoch,** Eric Pratts,^ Vinita Puri,^ Hina Qureshi,'* Matthew Reardon,'* " 
Robert Rodriguez,'* Yu-Hui Rogers,'* Deanna Romblad,^ Bob Ruhfel,'* Richard Scott,'* Cynthia Sitter,'' 
Michelle Smallwood,** Erin Stewart,"* Renee Strong,^ Ellen Suh,"* Reginald Thomas,'* Ni Ni Tint,"* 
Sukyee Tse,^ Claire Vech,^ Gary Wang,"* Jeremy Wetter,** Sherita Williams,** Monica Williams,^ 

Sandra Windsor,^ Emily Winn-Deen,'* Keriellen Wolfe,"* Jayshree Zaveri,** Karena Zaveri,'' 
Josep F. Abril,*"* Roderic Guigo,""* Michael J. Campbell,*' Kimmen V. Sjolander,** Brian Karlak,^ 
Anish Kejariwal,'' Huaiyu Mi,'' Betty Lazareva,^ Thomas Hatton,** Apurva Narechania,*' Karen Diemer,^ 
Anushya Muruganujan,^ Nan Guo,'' Shinji Sato,^ Vineet Bafna,^ Sorin Istrail,*" Ross Lippert,** 
Russell Schwartz,** Brian Walenz,** Shibu Yooseph,^ David Allen,*' Anand Basu,** James Baxendale,"* 
Louis Blick,** Marcelo Caminha,"* John Carnes-Stine,^ Parris Caulk,^ Yen-Hui Chiang,^ My Coyne,'* 
Carl Dahlke,^ Anne Deslattes Mays,^ Maria Dombroski,** Michael Donnelly,** Dale Ely,^ Shiva Esparham/ 
Carl Fosler,"* Harold Gire,^ Stephen Glanbwski,^ Kenneth Glasser,"! Anna Glodek,'' Mark Gorokhov,'* 
Ken Graham,^ Barry Gropman,^ Michael Harris,** Jeremy Heil,^ Scott Henderson,'' Jeffrey Hobyer,^ ; 
Donald Jennings,"*. Catherine Jordan,** James Jordan,^ John Kasha,^ Leonid Kagan,^ Cheryl Kraft,"* 




Michael Simpson,"* Thomas Smith,"* Arlan Sprague,^ Timothy Stockwell,** Russell Turner,** Eli Venter.^ 
Mel Wang,^ Meiyuan Wen,** David Wu,*' Mitchell Wu,^ Ashley Xia,^ Ali Zandieh,** Xiaohong Zhu^ 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 



^ THE HUMAN GENOME 

A 2.91-billion base pair (bp) const : i sequence of the euchromatic portion of 
- the human genome was generated by the whole-genome shotgun sequencing 
method. The 14.8-billion bp DMA sequence was generated over 9 months from 
' 27,271,853 high-quality sequence reads (S.ll-fold coverage of the' g^^^ 
from both ends of plasmid clones made from the DNA of five IhdividualstTwoV 
assembly strategies — a whole-genome assembly and a regional chromosome J 
assembly— were used, each combining sequence data from Celera and the " 
publicly funded genome effort. The public data were shredded into 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
; ' sequenced, without including biases inherent in the xlpning . and a^^^^ 

■ proceiiure used by the publicly funded grbup;T'his brought the effei^ 
. ^^ erage in the. assemblies to e^ 

" the final assembly oyer vyhat would be obtained with '5.1 1-foWcpv^^^ 
.two assembly istrategies yielded very similar results that largely agree ; ? 
independent mapping data! The^ asserhblies effectively covier the euchr^^ 
? :: regions of the human chromdsorhesl' M of tfie^geihoiine 

■ scaffold assemblies ^o^^^ bp or more, and 25% Jofithe genome;: k 
scaffolds of 1 d million bp or larger. Analysis of the genome sequence revealed ; /; 
26,588 protein-encoding transcripts for which there was strong corroborating ^ - 
evidence and an additional -- 12,000 computationally derived genes with m^^^ 

: ' matches or other weak su^^^^ 

obvious, almost half the genes are dispersed in low C+C sequence sepa^^^^ 
; . by large tracts of apparently noncoding sequence. Only 1,1% of the genorne 
isspanned by axons, Whereas 24% is in introns, with 75% of the genome being , ' 
intergenic DNA. Duplications of segmental blocks, ranging in size up toxhro- v , 
mosomal lengths, are abundant throughout the genome and reveal a complex ; 
' evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- ' 
velopmehtal regulation, and with the hemostasis arid immune systems, DNA 
: sequence comparisons between the consensus sequence and publicly funded = : ; 
genome data provided locations of 2.1 million single-riuctebtide'polymbrphisnris >^ 
(SNPs), A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity Jn the level of poly- . 
> morphism across the genonrie; less than 1% of all SNPs resulted in yar^^^^^^ 
: proteins, but the task of determining which SNPs have functional consbque^^^^ ; 
remains an open challenge. - ^ ; - ^ " - 4^ 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution,' the causation 
of disease, and the interplay between: the: 
environment and heredity in defming the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 

—quence of the human- genome was-first for- 
mally proposed in 1985 (7). In subsequent 
years, the 'idea met . with mixed reactions in : 
the scieritjfic community : (2):' However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
. was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. " ; 7 ■ " 

^ ' The modem history of DNAisequencing 
^ began in 1977, when Sanger reported his meth- 
od for determining the order of nucleotides of 



V .^A using chain-terminating nucleotide ana- 
logs (5). In the same year, the first human gene 
was isolated and sequenced ^4). In 1986, Hood . 

^ ^ and co-workers (5) described ah improvement 
in the Sanger sequencing method that included 
i., attaching fluorescent dyes to the nucleotides, 

- which permitted them to be sequentially read 
'/ by a computer. The first automated DNA se- 

- quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 

• ; ■ when'the sequence^ of two gen^ were pbtaiped . 
;= . with this new; technology' From earlyl.se- . 

quencing of hiiman" genoniic 'regibns^^^ ■ 

' became clear that cDNA sequences (which are 
'reverse-traniscribed -from -JWA).;^ be es- , 
:iseritial to annotate and yaUdate gbne predictions ' 
.in the huinan genome. These'stuc^es^ were the 

: basis in part for : the development -of .the ex- 
■'pressed seijuence tag;(EST) m^ethod of gene- 
; identification (5), which is a random selection, . 

^ very high throughput sequencing approach to ; 

:^=:characterize cDNA libraries:pte E 
led to the rapid discovery and mapping of hu- 

; man genes (P). The .increasing numbers of hu- 
. man EST sequences 'necessitated tiie develop- 

• menV of new/computer algorithms to analyze 
Jarge amounts of sequence. data, and in 1993 at 
The Listitute for tjenomic Research (TIGR), an 
algorithm was developed that permitted assem- 

V bly and analysis of hundreds of thousands of 
:ESTs. .This dgorito characteriza- 
tion and annotation of human genes on the basis 

. of 30,000 EST assemblies (10), v^'^' ? 
; ; The complete 49-kbp bacteriopfiage lamb-' 
V ' da genome sequence ^was .determined .by - .a 
shotgiin ; restriction /digest 5 methd ;19 82 
' (7i). When considering niethods for sequenc- 
• ing the smallpox virus genome in 199 r (72), 
- a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, m 1994, 
when a microbial genome-sequencmg project 
was contemplated at TIGP; a whole-genome 
. shotgun sequencing approach was considered 
s possible with the TIGR EST assembly algo- 
rithm. In 1995, the i.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(75). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach {14, 15), 
A key feature of the sequencmg approach 
used for these megabase-size and larger ge- 
nomes was the \ise of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in lengfli from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion {16) of an approach to simulta 
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•eously' niap and sequende the human ge- 
ome by means of end sequences from 150- ■ 
kbp bacterial artificial chrdmosdmes (BACs) " 
{17, 18). The end 'sequences spanned by 
known distances provide Idhg-fange continu- - 
' . ...;ity 'acrosstthe genome/ A 

: :v'BAC end-sequencing (BES) methold was ap- \ 
; c i;i:plied successfully to coifiplete. chromosome 2 ; 

r:K\from IhtArdbidopsis thaliana gtn^ 
1^:;;^ V;v^::In 1997,:Weber and Mycwre (20 proposed;. 

■^.^whole^g^OTrie'A^ 
■ ;^ A^;-human genbineAT^ 

received (27)3^ ' 
Jess than 5% of the. genome had been se- ' 
quenced/itw^is' clear that ±e'ra[e of progress 
/ . i / !vm ^thunian V^geinob 

i'V." .''^iwEis very slow (i2j/;and''the'.prd^cis'fbr^ 
■ ' finishing ttie genome by the 2005 goal^were^ 
. uncertain. ' • . *^ 'r'S' t" 

In early 1998, PE Bipsystems (now Applied 
. ■ ' Bioff/stenis) developed an ;autoniated,-'high- : 
\ throughput capillary DNA : sequencer, . subse- 
quently called the ABI PRISM : 3700 ;DNA' 
Analyzer. Discussions between PE Biosystems 
.. and nOR scientists, resulted in a plan to under- 
. take the sequencing of the human genome with . 
. . ': the 3700 DNA Analyier and the whole-genome 
. :shotgun sequencing techniques .developed at 
. ■TIGR'(2i). Many of the principles of operation 
tof.a genonie-sequencing facility . were: estab- . - 
Jished.in the;TIGR facility (2^).:How^^ 
: - . V: facility erivisioned for Celera - would , have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H, influenzae 
genome to the human genome with jts.complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster , geiipme was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubm and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequeiice of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28), The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensivjB final asseinbly was 

. not of value. ' . . ■ 

• These findings, together with the dramatic 
changes in the public genome effort" subsequent 
to the formation of Celera (2P), led to a modi- 

•fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to --S-fold 



: coverage and to use the unordered and unori-^: ; 
, ented BAG ^quence' firagments and subassemr. \ 
*: blies published in ,GenBahk by., the publicly . : 
. funded genome effort (50 to -accelerate the,.; 
: project We also abandoned the' quarterly 'anr,rh 
,;.nouncementS 'in the;absence bf interiiri assern-i$^^ 
:-;>;blid5'to report\ : v;?^:;;v-0 : ;^ 
^:ise^c:^lth6ugh^this;:stf^ 
i:; able result very .early that was consistent yvith a 
-^>vhole-genome:^shptgunvassembiy-^^ 
=:jfold coverage/-the human gra^ 
i:< not is finished as ihe Drosophila genomQ :Was / . 
r^with kh ehective*13-fbld coverage: Hoi^evef,at> 
^became clear-that evai with.this reduced cov-.-^ 
i Jerage istotegi^Celeraicoiild generate an, accu-r^; 
s' rately.ordCTed and oriented scaffold sequence p^^^^ 
•5the humanT^genonie.m year;^ jiuman 

*;.genonie sequencing wa^ initiated 8 September,. 
1999, and completed 17. June. 2000. ;The first \ . 
assembly was completed 25 June 2000, and the > 
... asserhbly reported here;'was completed 1 Octo-^ - 
: ber 2000. Here we idescribe the.whole-genome 
: I ^random ' shotgun sequencirig effort applied to ■ 
:. the human genome. We developed two differ- 
. ent assembly approaches for assembling the ^3 . 
; biUion bp that niake up the 23 pairs of chromo- X; 
• somes ofiht Homo sapiens' gem 
. Bank-derived .data were shredded to remove 
i* potential bias -to' the " final sequence . from * chi- . 
;•: meric\clones;-foreign'iDNA contamination; or 
fimisassembled -.coriti^^^^ '.correctly 
: : :and -accurately c assernbled ' genome' .'sequence 
with faithful order and orientation of cbntigs 
is essential for an accurate analysis of the 
hxunan genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
. struction of the genome. We also describe.our 
preliminary analysis of the hiunan genetic 
code on the basis of computational methods. : 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1304/DCl) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just begmning. 
To aid the reader in locating specific an- 
- alytical sections, we have divided the paper 
into seven broad sections. A sununary of the 
major results appears at the begirming of each 
section. .. . - 

1 . Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation ' . 

4 Genome Structure . / „\ 
' 5 Genome Evolution ^ - - - - :y • 

6 A Genome-Wide Examination of 
Sequence Variations j : ^ - . 

7 An Overview of the Predicted Protein- 
- Coding Genes in the Huirian Genome 

8 Conclusions 



1 Sources of DNA and Sequencinp 

• Methods v . y \/- . . 1 

'^5um>?u2/y: ;This .'section discusses the rational^ - 
.and ethical rules governing donor sclcciion to 
. ensure ethnic and gender diversity aloni* ^^ilh '" 
jvthe methodologies for.DNA . extraction -arid lp.\ 
;r-brary ^cphstructipn- - iThe pjasn^ library t6rv'^^ 
^struction ^i^Jthe ;;first ^critical ^step -in sholgiiii^f 
sequencing -v If the; DNA- hT^ are riot m^*^^ 
form in size,.nonchimeric, and do not randomly ' ' 
;;;represent the genome, then the subsequent steps '?? 
v caruiot accurately reconstruct .the genonic .s(s 
;quence.^e -ixsed automated high-throughpui fi 
^ ;DNA sequencirig and the computational inrni' " '^ 
i;sstocturbr;toienable,'.efficient; tracking of cim^lt 

V mous amounts * of - sequeiicie . information' (27,3 i 
; miliiori seqiience : reads; A 4.9 billion bp of sbv ^ 
■ quence)> Sequencing :,and itracking from both >" 

V ends of plasmid clones fix)m 2-,*10vand 50-kbp • ' 
/libraries i'were>- essential to the • computational 

• reconstruction of the genome. Our . evidcncc .- ^^ 
indicates that; the accurate, pairing, rate of xnd ;f 

; 'sequences was greater than 98%. : r-::'^\\'''y^;'; 

• Various policies of the United States and the v - 
. iWorid Medical 'Association; specifically vthcl/;; 
.-Declaration- of Helsinki, 'offer recommciida-^ v- 

• tions for conducting experiments with human 
subjects.-'. We convened. an /Institutional Rc-.:V- 

i/.view .Board,(IRB) (31) that helped us estab- j 
^: lish the: protocoi.for obtam ;using hu- 

^^ man DNA ''and the informed consent process ?;i3 
used to i enroll ; research vvolxmteers , ' for the - 
DNA-sequencing studies reported here. Wc 
/adopted several steps and procedures to pro- 
tect the privacy ri^ts and confidentiality of 
the research subjects (donors). These includ- 
... ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for spcci- 
. V mens and records, circumscribed contact with 
..• the -subjects.^by. researchers, and options for 
off-site contact of donors. In addition, Celera 
applied foi: and received a Certificate of Con- 
fidentiality fi-om the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived fi-om multiple 
donors of diverse ethnic backgrounds Pro- 
spectiye donors v/ere asked, on a voluntary 
basis, to self-designate an ethnogeographic • 
category (e.g., Afiican-American. Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (52). • 

Three basic items of information firom 
- .each donor were recorded and linked by con- 
fidential . code to the donated sample: age» 
. sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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Ir^collected, as well as five specimens of se(^ 
■ ^■^'^collected over a 6-week period. Pennaii;-iit 
'^^^lymphoblastoid cell lines were created by 
:;5? Epstein-Barr virus immortalization. . DNA 
Jj^tfrom five subjects was selected for genomic 
;,^|DNA sequencing: two^ males and three ;fe- 
: ; V males — oiie African-American, one ' Asian- 

Chinese, V one; Hispanic-Mexican^ and . twd 
^ Caucasians (see Web fig. 2 on Science Online 
4if at -iwww.sciencemag.org/cgi^ontent/^ 

13()4/DCl)/vilie ;;decisioa^of 

sequence \^ bf^fec^J- 

tore,' inciuidiiig ihe goal ojf kchie vid^ diversity^ ■ 
^'cSwell ;as^technical issues such as the 'quality of 
i|y^the DNA libraries and availability bf immbrtaK^ 
|i< izeii ceU-lines.iS:^:^0:(^a^^^^ 

^.1 Library construction and j^;: V: %<wv^ 

•'ii' sequencing ^-v. -^xUv'-^^^-i^rv-^^^^^ c^^fi^Z 
; 4- Central to the whole-genome shotgun sequenc-: . 

: jng process is preparation of high-quality plas-,;; 

^mid libraries in a vaiiety of insert sizes so tiiat si 
J pairs .of seiquence reads (mateis) are 'obtained, 
: -one read fix)m both ends of each plasmid insert ■ 
:Vf High-quality libraries have an equal representa- ; ; 
;i tion of all parts of the genome, a small number 
^'n ■ of clones withoiit inserts, and no contamination 
Vi '-firom such sduncei^ as the mitochondrial gehbrne/. ' 

■ md. Escherichia co// genomic DNA. DNA from ^: 
' if each donor was used to construct plasmid librar- 

ies iri 'one or more of three size classes: 2 kbp,- 10 : 
> kbp, and 50 kbp (Table 1) (55). 
' In desigiiing the DNA-sequencing pro- 
' cess, ' we focused on "developing a - simple ^ 
•■^j system that Could be implemented in a robust r 
^^• and reproducible mknner and monitored ef-:- 
:^? fectively (^\g: 2y {34)711 ^ : 1* : 
■ : ^ ^ •Current sequencing protocols' are based on : 
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the dideoxy sequencing method (55), which fV . rough the four production modules. A 
^ically yields only 500 to 750 bp of sequence - central laboratory information management 
per reaction. This limitation on read length has system (LIMS) tracked all sample plates by 
; made monumentel^gams in throughput a ^ bar code identifiers, llie facility was 

requisite for :the analysis of large euI^Tyotic .iC-supported by a quality, control team that per- 
^^genomes;'\\^ acco at the.Celera^it'fbnned raw niaferial ;and ;in^roce:ss testing 

. facility, which occupies: about.30,000.square 3 ^ and a iquality assurance group with responsi- 
r feet of laboratory space and produces sequence 'v ^i 'bilities - including :document .control, yalida- 
^data continu6usly:at.a,rate.of,:175.0q0 total tion,. and auditing^fthe fkcility. Critical to 
reads per da>^ Tlie I>NA-sequ facility is^^^^.the success of tiie;scale-up^vvas Jhe validation 




y'lilar by' design ;and automated.*^^^I^^ , . . , ,^ , . , a r:.:.--^- 

■:^-sample ^:backlogs fallowed ;:ibur^;^^ .2 Trac^^ 

^#m6dules|ttf^ojpbra^ 

Xtbraiy^; transfoniiation,^plati^^^ and colony V.vvbeen developed to process each sequeiice 
^;l^icldrig;^(ii)^DNA \templ^^ ^ After ;quaiy^^ ; 

^^j(iii) Tdideoxy:^sequencing;^ reaction set-up i average .trimnied. sequence. l^gth^ 543 ' 
:^'iand purification; and ^(iv)' sequence ^deter- .^v bp, - ^ 

mination :wittr; the ,ABI,PMSM:370a ^distibuted^with a .mean of. 99.5% ^ 

>;';^alyzer.>;Becausa^^^^ 

'^bf jreach -^mbdule^^have vbeeh,;'carefully than;98%^^^ 

'.^matched and .sample backlogs are,. continu- X ' quence was screened for matches;tb cohtam- : 
ously managed, sequencing has . proceeded V mants iricluding sequences of vector alone, E: 
■witho a single day'« interruption since the > co/rgenomic DNA^ and human mitochondri- . 

vjinitiation of the Z)ro5o;7^//a -project in 

: rl999.*The*ABI^ 3700 ; is-a ifully automatedv^^^ 

.. capillary array sequencer and as such can^; . discarded. . A total of i71 3 : reads 
: be / operated ::with a vmininial amount/^of ?a>^ genomic DNA and 114 reads, matched \ 
a hands-on time, -currently {estiniated at about ^r^ :-the Jbumaii rnitqchpndrial genome. . C 
15 min per day. The capillary system also . . '* • • • . ^v^^^^^^^ 

■ facilitates correct associations of sequenc- ■ 1.3 Quality assessment and,control ' 
::^ing traces with samples through the elimi- . ./The importance xif^he'base-pair;!^^ 
i.; nation of^niariual: sample ^loading and ;laneJ^ ^curacy, of the sequence data increases the y 
■.rtracking:^nofs Sssocjajed^^^ 

^^Aboiit 65 production^staff wer^^ ^sequenced ;?in<iease^^ 

' trained, and were rotated on a regular basis •; • read must be ^placed .uniquely Jri^^^,^^^^ 



^ Table 1. Celera-generated data Input Into assembly. 



Individual 



No: of sequencing reads 



Fold sequence coverage 
, (2.9-Gb genome) \ '- 



Fold clone coverage 



Insert siie* (mean) 
Insert size* (SD) 

%Matesth-K -tri k .n-..- 



A : 

B . 

; C ,^ • 
D. 
F 

Total 
A 
B 
C 

D ; 
F 

Total 
A 
B 
C 

D ' 
F 

Total , ; 
Average -.^ 
Average ; 

fAyerage,^': 



Number of reads for different Insert libraries 


2 kbp 


' 10 kbp 


5b kbp 


Total 


0 


0 


. • 2.767.357 


2.767,357 • 


11.736,757 


. , 7,467.755 


66.930 


19,271.442 


853,819 


881.290 ; 


0 . 


1,735,109 


952.523 


1,046.815 


0 


1.999,338 


■ ^ • • ' • 0 


1.498,607 


0 


1,498,607 


13.543.099 


10.894,467 . 


' 2.834.287 


27.271,853 


0 


- - .■ . - ;-o-- 


0.52 ■•: 


0.52 


"^ 2.20 


• . 1.40:. 


0.01 . 


3.61 


0.16 


1.17 , 


' 0 ■ 


032 


0.18 


0.2b 


0 


0.37 


0 


: 0.28 


0 


0.28 


2.54 


2.04 ' 


0.53 


5.11 


[ 6 




''" , 1839 7 


1839 


2.96 


11.26 - 


0.44 


14.67 


0.22 


133 


: 0 


1.54 


0.24 


1.58 


0 


1.82 


0 


. 2.26 ■) 


0 


2.26 


3.42 


, 16.43 . 


.18.84 , . .^^ 


38.68 


1.951 bp 


10,800 bp . 


: 50.715 bp ■ 




6.10% . 


, _ . , 8.10% 


14.90% 




74.50 


* 80^0 ' 


" 75.60 " ^ -^^^ 





Total number of 
base pairs ^ - 



i.502.674,851 
10.464,393.006 
942.164,187 
1,085.640.534 
813,743.601 
14,808,616,179 



♦Insert siie and SD are calculated from assembly of mates on contlgs. t% Mates Is based on laboratory tracking of sequencing runs. 
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4ome - and even a niodest -^rror , rate ' cin . , ; ;entire human^: geiiome in; a^sirigle ;^mty.^ ;^;aSt:i^j^^ vi^ 6f &e 
.-••reduce- the- effectivenessTofr^semb^^^^ In we were able to ensiire ;iinifbi^quality> ' 
' . ri'f mpt^^- - /ctandards and the cost advantaees associat- ments to a region or chioniosome on the basis 




^Potential Erii^ 



?*^Human Sarriples'^t/ 

"\ IMedicat Affairs] r.v^ - n: 



--sample scre_en[ng ^ 



% Tissue. Samples/; '^i. 

i ' [DMA Resources] . 




^; DNA/RNA (External) 

■ [DNA Resources] , 



- : ^ QA Process 





! QC: size and.clarity.;// 



. . QC:'si2jB & concentration ;. < 



Libraries;: 
-[DNA Resources] 'r ,- . * 



« . . . . 

QC: titer A functional test 



"Mm 



- :i \/ [bNA Resources] ' " 



Fluorescently Labeled 
DNA 

[Pre-SequencJng Lab] 



Jrace Files [UNIX] 

*. [Sequer)clng Lab] 



validate trace files 
- load QCDS quality info 




MM 



^r",^^^i;Librafries v iri^iS 

' • ■ •'.[DNA Resources] \: 



/:i^FluofesceiitlyXabe!ed [". - 
':'^{Pre^equencing tab] ' 



External Fragments 

[Content Systems - £0^7 



QC: byte count, 
renr^ove duplicates 



External & Trimmed 
Fragments 
[Content Systents] 



Proto l/Q Files ^ 
■ [Coi]tent Systerps] ' ■ 



ii 



* : ■ •■ 

•gatekeeper* run again • 



*mi»t m$mm^ii^ -vector s contaminant 
■?0st-SequenC!ngMs5 screening . . , ... 

im$miemi^——' — ■ ■ ► 



f QC: monitor statistical 
' summary data . . ^ 


. Trace Files [NT] 

[Sequencing Lab] 


; 
















vTrimmed Fragments 
ll , [Content Systems] 



'gatekeeper' 




1 syntax, duplicates & 
1 Qualitv values w 


Proto I/O Files 

_ [Content Systems] 















m. 




Chromosome 
, Team OA review 



Assembiies - 
; [iR/CTj • ; 



riP 2 Flow diaeram for sequencing pipeline. Samples are received. samples and data with both Internal and external entities according to '. 

sSrtV and pro«sTed in co^r^pK Zh standard^perating proce- defined. quality guidelines. Manufacturing P^Ji^lP^*?"?!^^^^^^ 

durerwith a focus on quality within and across departments. Each quaUty control measures, and responsible parties are indicated and are 

process has defined Inputs and outputs with the capability to exchange desaibed further In the text 
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' and provide a comparison to the public gen'^"^^^: 
• ' sequence, which was reconstructed largely / 

an independent BAC-by-BAC approach Our 
■ • assemblies effectively covered the euchroinatic 

regions of the human chromosomes! More than 



Z1 Assembly data^^^^^^ . Piuences. In the past 2 yeais the PFP has ' 

We used two independent sets of data for our S,ioinised on a product of lower quality and corn- 
assemblies. The first ■sras a random shotgun pleteness, but on a fester tiipe-couise, by con- 
data set of 27.2? million reads of average length i centra on the'production o'f Phase 1 data 

ono/ p*t, ■ • ^'^^^P'P'^ced M:aiera.^^Tiis'consi^ 

iMf}^^m?^^^^=miS^^^ fiom 16 libraries ■h^dono. :^' ■ ^. a ° ^5^^ ' 

^U<?Si^0'900=bl>,or ,i^:^thW^s<ieened the b^tig sStoesib^con - 

;^;;<lifffe#donc«s:U^ ^:byi4ing?SlSsR]£Z; . 

.M^ -f^^C*. *!: -oa^r;^:! ^c- : ^^J^^^^ against three -^ata setsr^O) vector sequences 

i; : ' - mate panis-fi^to^ Iflja^ core ;(55),''filtered :foi:'-S"25-bp. :•; 

^8% ^eqiiencd identity Itthe !erids';^ .: 
sequence .and a'3 0-bp matdi;inte]^ 
f^^t:?;*;.;' . ;i;^TK-' ; r™-r"* 'r^t^;:^"'^^ ^":>^vr**y^^,«i"p^ auu:::^>^.io me 5sequence^tni) the ;^ 

: s^hes consist of a set oTcontigs that are -^.cbne : that has :seque^ both ends: .Hie #^tries, filtered at ;200 bp at 98%AWlienever ^ ■ 
..orderedandonente^^ of vec^wasfb^nd: Within' 

..^mapped ;to^clu:or^^ 

. lection of overlappmg sequence reads that pro- ^ Celera trimmed sequences;gave^5 J X cover^>^aheseicriteria we removed Zd^pjp'i^po^^^^^^ 

; :age of the -genome; and clone' coverage was ..-'sible^ 'cW^ . 
• . 3,42Xi .16.40X, and ISIWX foir the 2-, 10-, and {\ 'Phase 3 data, 6l;0 Mbp from^the Phase 1^ 
- 50-kbp ^Ubraries,- respectively, for a total of and 2 data; and 16.1 Mbp from the.Phase 0 . . 

38JX clone coverage. >^::v^^^^^^^^ data (Table 2). This left us with a total of ^ 

• v v'^rThe second d^^ yras from the publicly>r;..4363.7 i^Mbp of PFP; sequence ' .data 20%^ 
■ funded Human Genome Project (PFP) and is /finished, 75% rough-draft (Phase 1 and 2),^ 
primarily derived fipm BAC clones (i^?). Ibe . and -5% single sequencing reads (Phase 0). " 
;,- ::BAQdata input to flie assemblies came from a VAii additional 104,018 BAG. end-sequence 
^download of GdiBank bn^ 
i '(Table '2) totaling W33 Mbp of sequence. 
;; .The data forrcjach BAG is deposited at one of 
four levels of completion. Pha^ 0 data are a s6t 



7 vide a consensus reconstruction for a , contigu- 
ous interval of the genome. Mate pairs are a . 

-central component of the assembly strategy. 

{ ;Tbey are used to produce scaffolds in vi^ch the 

;"--si2e .of gaps between consecutive contigs is : 

; known with reasonable precision; This is ac- ; 
complished by observing that a pair of reads, 

:^ohe of which is in one contig, and.the other of ! 

■ which is in' another, implies an orientation and v 
distance between jhe two contigs (Fig. 3): Fi- 
nally, biir assemblies did not incorporate all ; 

yreads jnto Ae^fii^^ ^ ^^^^ 

v^"chafr,^ arid ^ically consisted of reads from .from a very li^t shotgim of the BAG, typically 
witiiin highly repetitive regions, data from other t ' less than . IX - Phase I data"^ as- 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 



mate pairs were also downloaded and iii-" 
eluded in the data sets for both assembly « 
processes (18), . ^ 



semblies of contigs, which we call BAG contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAG 



! Mapped 
Scaffolds: 



STS 



-Genome 



t 



Scaffold: 



t 




Read pair (mates) 



Contig: 



Gap (mean & std. dev. Known) 
Consensus 

Reads (of several haplotypes) 



' ■ . ... . ■ ••SNPs „ , ' 

' ' — BAG Fragments . 

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
internally derived reads from five different Individuals (black lines) are combined to produce a 
contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using 
mate pair Information. Scaffolds are then mapped to the genome (gray Une) with STS (blue star) 
physical map information. i ■ ... ^ ^ ' 



^%:;2.2-'Assenibly strategies 
' V: Two different approaches to assembly, were 
. pursued. The first was a whole-genome as- r . 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
"localized to large chromosomal segniehts and- - - 
then performed ab initio shotgun aissernbly on . 
■ each set. Figure 4 gives a schernatic 'of the 
overall process flow. - - - 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a' 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering p£the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAG data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAG in the genbme Inor its 
assembly of bactigs was used in this process. 
, Bactigs were shredded into reads because we 
found strong evidence that 2.13% of tiiem were 
jnisassembled {40), Furthermore, BAG location 
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V?J>5"infomation ,was 



N O M ^ 




• Vii- - J ".""vV • S'.v^^^ '^..-^ ::; ' r - ' bactigs^W genome locality, frornibmccxtq. 




. JO'P) 

?; Total cbhtem masked 



. Number of cohtigs/. V: 

Total base pairs : " • > ? i S ' 
r Total vector masked (bp) : > 
Total contaminant masked V 

,:.(bp) : ; 

Average cbhtig length (bp) J 

r Number of accession records 

Number "of contigs ; 
^ Total base pairs 

. . v; ! ' v- -; -^^^ (bp) : 

' . . r 7;:j^ J Total contaminiant masked r; 

"■'■■'■(bp)-.: ' "■ ' ' 

Average contig length (bp) 

Production Sequencing Nurpber 'of accession records 
: Facility, DOE Joint^;f :^:v Number.of contigs : , ^.v; . . ■ 



USA' 



■ • Baylor College of 
Medicine; USA 



vi'i r - : - . K or •*(X)mp6rieIlts^^ tJiat be; detetmiiiec^^ viiih 

^feija 829,358^i^Hei ^toV^eactf 
'^^:r^^^.2i202,^^:>bactig-^ 
^;v^iX98,028 ensufe^an independent. al) initid asseiiibly of 

.... , . . vK^^^^i * 

>';;7,853 :; :/. .134,516 rJ,. 

?i;3,232;;if M^:??>^1360^ and the^ffert ofintw^mosom^ 
t ,t# • 'v6i,iB12 ^^v^iS^iiSOO'^:^:^^^ in a 



: Genome Institute, . 
USA 



The Institute of Physical 

and Chemical 
. Research (RIKEN), 
; ' Japan — - v - 



Sanger Centre, UK 



Others* 



All centers combinedf 



■ Total base pairs . . . ^ , 
total vector masked (bjp) f ; 
Total contaminant masked • ' : * 

(bp) ^ 
Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 

-Total vector masked. (bp) ^ 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs ^ 
Total base pairs 
Total vector masked (bp) 
Tdtal contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 

Number of contigs 
r Total base pairs . 
. Total vector masked (bp) . 

Total contaminant masked 
(bp) 

Average coritig length (bp) 



4Ar.;21,604i,, 
rp^'562^ 

- vO 



135 

; L 7,052 
:8J^a2U: 

Hv^: 22,644 
7^ 665,818 



V 270,942 
^;i.476.141 

^->^^.^• 9,079 

'::^::-i:"-T;626'--^ 

44,861 
265,547,066 
% 218,769- 
^ .Cl.784,70q^:, . 

5,919^ 

: 2,043^ 
34,938;- 



8,287.. 
^469,487' 



1.231 
0 

. 0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

42 
5,978 
5,564,879 
57,448 
575,366 

931 

3,021 
. 258,943 
209.930,983 
. : 1,655,293 
14.918.135 

?- t^^r ; 

811, 
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■Reconstruction of the genome that was relatively 
^ indep^aidenf of the whole-gehome assembly rc--/| 
: suits so that flie two assemblies could be com- ,| 
126319 - v.-pared for coiisistency. The q^^^ 
.^>363 toy tioningiinto t components;. was icnjcial . SQ:;tli;'M| 
• ^363 ^^frdifferMitvgenome . regions were Jiot mixed t(v ^ 
49,017.104 . gether; We constructed components from (i) Hic 
..;;.4.960...;,,j,n-es,,.sca£folds of thersequence from.ciicli 
v„i 485,1 37^^ BAXJ md (ii) assembled Scaffolds of data uniqiic ;h 
• n,c 033 ' : ■ to' Celeia's data set TTie BAG assemblies were 
■ . - -- ^ obtained by a combining asseinbler that used ll)c ; 

i^ii - 34.938, ^7^^ 

-■^■4MlS^--vi18387::^?.andcb^^ 

^^J^'^ . . ..stretch.-fte mofe^accurately one can t.le thcsc^ 

scaffolds into contiguous component."! on m- 
basis of sequence overlap and mate-pair inloi- , 
mation. We farther visually inspected and c ^ : 
rated the scaffold tiling of the componc"ls lo . 
„,81 firfiCTincrease:itsaccuracy.|brthefintJC^^ 
IVajl assembly, all but the partitiomng was ignored. .. 

^dependent, ab initio reconstruction of 
• "e Suenc^ in each component was^.a"i^, 
by applying our whole-genome asse'"^,"^,^ V 
ri J to the partitioned, relevant Celera dat < d , 
the shredded, faux reads of the partitionccl. .c. , 
evant bactig data. 

2 3 Whole-genome assembly 

-n.e algorithms used for whole-gcnom^^^^^^ 
sembly (WGA) of the human genome J 
enhancements to those used to produc^^^^^^ 
sequence of the Drosophila genome rcpon 

in detail in (25). ' r„ninclinc 

Tht WGA assembler consists o.f a P^'^.^^ 
composed of five principds^^ 
Overlapper/Unitigger, Scaffplde^^ and JUP 
Resolver. respectively llie Sere n^^^ 
and marks all microsatelhte repeats wUh ^^^^ 
than a 6.bp element, and screens ou^^ 
known intersperseci repeat el^^"^^^^ 
ing Alu, Line, and nbosomal DNA. ^ 
regions get searched for overlaps, 
screened regions do not get searched^^^^^^^ 
be part of an overlap that mvolves unscrc 
matching segments. - - : 



8.422 

1,149 
25,772 
182,812.275 
203,792^ 

308.426 

7.093 

4,538 
74.324 
689,059,692 
427.326 
2,066,305 
9,271 

1.894 
29.898 
283,358,877 
279,477 
1,616,665 

9,478 

21,015 
409,628 
3,360,047.574 
2.438,575 
16.311.664 



8.203 



80,867 

300 
300 

20,093,926 
. - 2.371 
" 27,781 



2.599 
246.118,000 
25,054 
374,561 
94,697 

3,458 
3,458 
246,474,157 
32,135 
1,791.849 

71,277 

9.137 

.9.137 

835,722,268 
82,284 
3.365,230 

91,466 



aggressive and thus more likely to make a 
mistake. For the human assembly, we.contin- 
ued to use the first "Rocks" substage where 




end-to-end overlaps of at least 40 bp and with >^ ^ fh^ n^^i?^.f^;in^^^?§^^^^ 

no more, than 6%, differences, in the match. ; 
v Because, .all data are scraipulotisly,i 
. . trimmed/ the . Qverlappeir xan insist 

plete overlap matches; Computing ! 

■all overlaps.took roughly .10,000 CPl) hours " apthpr'^nt^^^^ "'""^^ lo- ..rwo or-'more ; mate pairs wim one^ot their 

: with a suite of fbu^rocess^'Ig^S^- SS^^ 

a.daysin,el^sedtixne;with^p^a^^^ 
operating in paiallel;C;:>>.-;V:fa;;^^^^^ 

^i:-..Eyeryoyerlap-coniUt6dS 

.^.cally^a l-in:iO"T,event and thus^oti^coinci^;^^ 
.^^ental eyent. :«^ 

,^;natorialIy..difficuUjs,thatirfiii;.iikii^ i 
.tvlap^:ai^actuany.saniledftofflSv^ 
vzitgipns of Ae. genome, and;thu^ 

:::the sequence reads.should*e:^sembl^ tb:^4^S 
.,geaieA.e.ven more overlaps are actuilly.from, ; 
.&iv^:di.inct.opies^^^^ 

. element not screened above. . thus ,constitutin| -« mee^^ 



y^time;:^usy!aImost everjr,- b^^ not dll, of the 



tiiie overlaps and the latter"repeat-induced ' :petitiye elements ,lmd o6casiohallV "to sn^l 
^^^1^::^-^^ choos..5,.equencing gaps. TT.ese scaffolds ^« 



in the process 

We .achieve this objective in the Unitig-:. ■ 
ger. We fu^t fmd all assemblies of reads that ; 
- appear to ibe uncont^ed with^ respect to. all 
other reads. We call the contigs formed from 
these subassemblies imitigs (for uniquely as- 
sembled contigs). Formally, &ese.unitigs are ll : 
:• the- uncontested interval .subgraphs\of: the:-/ 
j jgraph of all qveriaps (^), Unfb^ 
though empiiicaliy many of these assemblies 
are correct (and thus involve only, true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
l??^*iss are easily identified because tiieir av- . . 
erage ^coverage Septh^is too'iiigh to be con- ' 
sistent with the overall level of sequence' 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stririgent threshold, identifies a subset' of the 
unitigs that we are certain are correct. In / 
addition, a second, less stringent threshold s . 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again ahnost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the ' 
stretches of unique DNA that .are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive eiement 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% ^f all 



genome 

. For the i)ro50/?^//a assembly, we engaged 



!a read does hdt belong it rarely agrees with 
.; .lhe:rernain*der of the.reads.*-Therefore, we 
i:^ simply^asseinble this set of reads within the 
V; ^gap-'^eliminati^g ,any Teads !that co^^ 
: ■ tiie assembly. This operation proved much 
' Tmore reliable than the one it replaced for the 



in o <.f««* X 1 . -o-o" — '^"""iv, uiou iiic uuc u rcpiacea ror me 

.;.m a,. three,^^^^ • strategy ^.vi>^A//a assembly;, in ^the assembly of a 

where,each.stage,.5Yasvprogressively:^^^ shotgun data^et of hmnan cliomo- 



5J1)C_CeIeraReads 
39X mate pairs 



Public Bactiq s 
(from 33.421 BAcW) 




Baciigs & Cetera pairs 
(^JlJiR^dpyBAC) 



Combining^ 
Assembler^^x 



WGA Assembly 



CSA Assembly 



Fig. 4. Architecture of Celera's two-pronged aisem% oval denotes a comDutation 

SSinP/^h'""-"!^ ^ ^'^'^ with the labek;on-S 

descnbing the nature of , the objects produced ^and/or consumed by a process Thir fZ^^ 



li ^^ummari^s the <«scussio„«^^^ ^hfs f, 

www.sciencemag.drg SCIENCE VOL 291 ■ 1« ctnbiiVov ^nn^ ^ll ^ *^^^ ^^ 'r* ' ' 



16 FEBRUARYJ2001 




1311 



T H E H U M A N C EN O M E 



... ,'',some 22, all stones .were placed coirectlyi^v.^/^.^ye.requi^ each -was 'essentially exponential. 

' -:t-^.;^*^xliie"fo of resolving" gaps :is^<t6ii;kJlAlvL By maldng thie OyerlappeVahd Unitigger^-^Moie'lh^m 'of all gaps were less than 500 

rc-cvfill them with assembled BAG data that coyer.g;*^ mcrementai;-we;were able to achieve the .'same bp lqng/>62% of all gaps were less than 1 kbp 

".'"V,:".(ithe gap.; We call this external ""'^'i'^"" " ^ ^rvr««„*o*;«« ««*t,^o -'-m^*,,, :^^a -^inn i — 

J:,.' Vi We did not include the yery 
. ■ .bles";substage'. described 'in 

>;:V^rWork,- which made^ enpu^ : 





^W.GA*vapproach,;^we pur- 

At the final'stage of &e'assembly. process/; GS 160j^ W^^^).; Th was 
: and "also^f.at ?several =|intennediate ; pk>inte y;] total ;riin;;pf .the;.asseinb^ was .ji^v-^mtended ;toj subc^^7de .'^ :seg-' 

* -ii-'^cdnsensii^jS^^ CPJJ hours!! T '{-^?^j^ • "VrK^^'^W^^^^fiW^ as-^ 

■ , . -.'"duced. Our algorithin is driven by the princi- 'The * ^assembly''' of "^C^lera's^/idate^'i to^ ;;yembled mdiyidually.^^^ We .^xpected .that ^this, 
' ' ' P^® ' of * ma5dmuni'*parsimony,^7\wth '/q^^^ the shredded bactig dat^" produced a s'^^ of .^^^^would help in^^resolution bf tege interchro-' 

r : value-weighted measures for evaluating' each X". scaffolds tolling 2.848 .Gbp in span.aiid con- T 'mpsomal duplications and improve the statis- : 
. " • r'base. The net effect is a Bayesian'estimate of !^i;sisting of 2.586 Gbp of s^uence;:The chaff, or j/: V: tics, -for calculating 
:v .[ the . correct base * to. rieport/at. each position, ^^.i j^set, , of meritalized;assembly ' 

.; ' Coiisensus generation uses Celera data when-//- ??;nunibered 1 1 .27 million (26%), which is - con- 
. ever it is present. In the event that no Celera :{;;^sistent: with oxiri'expeiia^ 
- data cover a given region, the BAC^ data. jvCMore^ti^ 

"^^ ;sequence1s used.". ■ a . vfe^v;-^ «;rWscaf^^ long;" and .these; averaged ■ 

c .' ; A key element of acbiem gaps with a , total of 

^ : human genome was to pariallelize the Overlap- ;, -v 2.297: Gbp of sequence. There were a total of 

• per and the central consensus sequencefcon-.j^j 93,857 /gaps among; Ae. 1637, scaffolds >100.v, 
. ,/ : 'jstriicting subroutines; in additi^ i^rra^^cularj^^ 
. - y a ;real issiiej-^a straightforward ] application^ of the average cbntig ' size was 24.06 kbp, ;and the^^ f . ;\entiy ■ arid thps^'thaf di Public ' 

1 -i^Hht sbftwaiie we had built for -.l-^Jayerage gap.size w^ 2.43 kbp, where 'the. dis-/.' to 



tering Celera .reads and ;bactigV iiito' large, 
multiple megabase .regions of the genome, . 
[ and then rurming the; WGA.assembler on the 
vGele'ra -data • and Ashredded,'^.feto Jria^^^ ;.'ob^ ' 
tained from the bactig data. .. 
- : '. iThe first phase of the.CSA strate^ was to r 
/separate .Celera reads;into those, that match^^^ 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 



Scaffold size 



^ No. of bp in scaffolds . . _ 
(including intrascaffold gaps) 

. No. of bp In contigs 
No. of scaffolds 
No. of contlgs 
- No, of gaps 
No. of gaps :£l kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

. .(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp in scaffolds 
(including Intrascaffold gaps) 
- No. of bp In contigs 

No. of scaffolds 
. No. of contigs . ;: . / - 
"No. of gaps , 

No. of gaps 1 kbp ' ^ . ' '\ . \ 

Average scaffold size (bp) ' 
^Average cohtig size (bp) - ^ - t 

Average Intrascaffold gap size 
.. (bp) 

* Largest cohtig (bp) 
% of total contigs 



- .- All . 


>30lcbpv • - 


\>100kbp "]•"■: 


. ;>566kbp . ; - ' ■ 






Compartmentalized shotgun assembly 






2.905,568.203 


. ; ; 2.748,892.430 y-\'^< 


•2.700,489.906 = 


2,489.357.260 ' 


2,248,689,128 


2,653,979.733 


2.524,251,302 


2,491,538.372 


2,320,648,201 


2,106,521,902 


53.591 


2,845 


1.935 


.1.060 


721 


170.033 


112,207 


107,199 


93.138 


. 82.009 


116.442 


109,362 


* 105.264 


92.078 


81,288 


72,091 


69,175 


67.289 


59.915 


53.354 


54,217 


966,219 ' 


1.395,602 


2,348.450 


3,118,848 


15.609 


; 22,496 r ; : . 


, . 23,242 * 


24.916 


25,686 


2.161 


v2,054 


1.985 


. 1,832 


1,749 


1.988.321 


1,988,321 


1,988,321 


1.988,321 


1,988,321 


100 


95 


94 


• 87 


, , .79 




Whole-genome assembly 








2.847.890.390 


.2,574,792,618 .. . . .. 


2,525,334.447 


.2,328.535,466 . 


>: . 2.140.943.032 


2,586.634,108 


: 2,334,343,339 . / 


2.297,678.935 


. 2,143.002.184 


i.98'3.3b5.432 


= : 118.968 


■' . : 2,507-- ' 


1.637 


^818 


• 554 


. 221.036 


: . 99.189 . : : 


: . 95.494 


„ 84.641 


- 76.285 


102.068 


. . - 96,682 r; 


. 93.857. : . 


• 83.823 




,62.356 


■ . '\ ; 60,343" ; , 1: 


. 59.156 ... 


. s 54.079 


0... .:,rT r^49,592 


'23.938 


: : i,627.04V ' . ' 


' ' 1.542.660.' 


2,846.620 


. = 3,864,518 


11.702 


• i : 23,534' ^ 


• - ' '24.061 • ' 


25.319 


* 25.999 


2.560 


2,487 . 


2.426 


2.213 


2.082 


1.224.073 


' 1,224.073 ' 


1.224.073 " 


1.224.073 


1,224,073 


100 


90 


89 


J 83 


77 
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properly place a Celera read, so all reads were^ 
first masked against a library of common ' 
repetitive elements, and only matches of at 
least 40 bp to immasked portions of the read 
constituted a hit; Of Celera's 27.27 million ' 



THEHUMANGENOME 

. -issembly took place, but not enough Celera 
/data were matched to truly assemble the 0.5 X ; 

y^.. set^ represented by the typical 
Phase 0 'B AGsKXhe /combiniiJg assembler 



q ric or contaminating sequence (from 
anomer part, of the genome) would not be 
incorporated into the reassembly , of the .com- 
ponent because it did not , belong there. In 
f effect, the previous steps, in.rthelCSA process 



J ort-T^ Ml- , •■wasalsb^^ 

reads, 2076. .million matched .a bactig : and^^ 'SNP. identificatio'n - >nnfim.at; J ■ ' ' ""^ Previous sieps.m.tne, A process 

; another 0;62 niillioK^ i^^,r^ch:^^ frag- 
haW'anyriaatches^-^erenbfe&el^^ 

= fied as belonging in the regicm:6f^ l>actig^s^^» gSoi2^ we 
BAG because their hiate m^tch^A ih^ u^^^ 'lc^^ !!7r^ f^?^^^ ligbt-shoN. apphed the assembler used for. WGA^ to pro- 



we'&tuhate !that 240 Mbp :of uniaue Celera ^S^^^9 ^^f^^ir^^^^^^-^A -^ - '-"^f"">>"^.^x^xo. . morc/.man=^yu.u%^^ 

- V In the next step of the CSA-process a 
vcoa.bining-^seJirt6ok:a,^:^i.S5^:W 

: - Celara reads and bactigs fbr^ BAC^t^ and S Wcif LSy^^vES,^^^^ • ^ G^of .sequence. ^TTiere, were. >a ;^tal of ; 
' poKhiced atf assembly-of the cShil^ 
; for-.that localeVTTiese Ughr^uality seque^^^ 
^ reconstmctions were a trl^L 

utility wai simply to proVi'de more reliable At th^la„f A • Vi average contig . size,, was. 23.24 ikbp^M^ 

infomiation for fte ptoses 6?S t^SJ ■ Csci^S^SAr"^^ - ^° ' ^^'"^^^ gap size was, 2.0 kbp where each 

vponents.was to determine the order and over^-^ ' ' -- ^"^^uivanous 

;vlaP;tiliii^;of:these'^ 
- scaffolds across the genome. For thisi we 
;;used Celera's 50-kbp mate-paks information,^ 
and BAC-end pairs (7^) and sequence tagged 



matching Celera reads to determine if there 
• are • excessive pileups vindicative . of : xm-. 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positions 



size ranges. : Consider J, also . rthat ^more than 
49% of all gaps were '<5d0 bp long,<= more 
than 62% of all gaps were :<1 kbp, and all 
gaps are <100 kbp long. Similarly, more than , 
'73% of the sequencers m contigs >. 30 kbp, , - 



areremoved/Hien all sets of mate^irs to ^ ;^^e sequence^is^ contigs;^ 30 kbp, 

bonsistentlyimply„the^sanie.^i^l^ 

■ ^"^^^^^^.^^^^ n^:of>-^)rovides^siunmary statistics:^ structure^ 
- scaffolds, we chose not to produce this tilin a ' of thtc acc^T«Ku, „ • _ i ' 



of two" bactigs aire' bundled into a; link and 
weightedaccording to the nimibei: o f mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
inate-pair bundle can tie together two fonm- 
^lY? -???l^olds^5;il_inc^ to form a 

single scaffoFd only if itTs consistent with the' 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18.973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted - 
of 91.52 reads of average iengi 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 



^' scaffolds. we chose not to produce tiiis tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the ©raph'of tiling overlaps and. the ^ 
evidence for xach.'.^A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence oyeriap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-imique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the ^sembly algorithm could 



" , ^ ■ . ■ ."i*"* wav,w5d, luc iuiscmDiy aigonuim could 

/t^k^T^' f — ^^'^'^^^ ^^^^ ^ ^sembly of bactigs and 

size 873 bp. Basically, some small amount of J remove chimeric content iri a PFP data entry. 



of this assembly with ia direct comparison to 
the WGA assembly. , 

2.5 Comparison of the WGA and CSA 
scaffolds 

; Having obtained two assemblies of the hu- 
. . naan genorhe via iridependenV computa^ " 
processes (WGA and CSAX we, compared 
• scaffolds from the two assemblies -as another 
means of investigatuig their cornpleteness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads, or bactig shreds) was obtained; this 
. amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. ' - - - ' - ^^^^ 

From this tabulation, we estiinated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not" covered by. a matchmg .segment in the./ 

-J o&er assembly' Som'e.82.5.Mbp of the WGA.\ 
; A (3.95%) was not covered by the CSA, where- , 

>as '2p4.5 Mbp ^j;8,2i5%) of the CSA was Jnot .; 
covered by the WGA.. Thi^ estimate did not 
: . require any consistehcy .of the assemblies on 

- any uniqueness - of the . matching segq^ents. ;,; 

VThus,^ another ;:analysis iwas conducted *.in.^, 
V: which'inatche^;© :1 kbp between .a.;^ 

i'^: pair of . scaffolds .were ^excluded unless theyi. 
, ty. wereiconfime^ 
: V xons^ent -^d^ni^^ 
j^;^ :;Some. measure;M consist^ coverage: ^1 .982 
= '.'Gbp (95.00%>'6^^ covered by the;- 

CSA^ ai^ 2.169 Gbp (87.6^) ofthejCSA is , 

.covered by the WGA by ttiis more stringent 

measure^ i?^' cij^Sily"^- 

fJ:? .The . comparison-.oif WGA to/ CSA; also.,; 

£ permitted evaluation of scaffolds for sttuctuf- ; 
:al incoiisistencies; We looked for instances iri ^ 
which a large' section of a scaffold fironi one 
assembly. matched only one scaffold from the . 

- {/other assembly, but failed to match oyer the . 
, :r full /length of the --overlap implied by .the 

' .matching segments. An initial set of candi-. 
■ dates was identified automatically, and then 
, each candidate was inspected by hand. From ^ 
.* this process, we identified 31 instances in = 
which the assemblies appear to disagree in a 
; nonlocal fashion. These cases are bemg fur-, 
ther evaluated to determine which assembly : 
is. iii error and "^ly- // i / - ■'■'■^ ■ 

- .* V In additipnV' we evaluated local inconsis-"^. 
* tencies of order oir orientation. The following 

results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees Jwith the- positions they match in the 
former). Most of these small rearrangements 
involved. segments on the order of hundreds, 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 



50% 
45%- 
I 40% 
1 35% 
W 30% 
. I 25% 
"5 20% 
g 15% 

i 10% 

Q. 

5% 



:/t h e H u n ,g e k m e: V u 

The CSA assemWy^wasia^^ 
'^points better-in teiids of ao^^ 
v/more :consisteht than 'die . WGA,^be^u^ 
I .' was in effect performiiig a few thousand shot- 

gun assemblies of megabascTsized problems, 
'7 whereas the WGA is^ei^cfmmg" a'.shotg^ 

assembly of afgigabase-size^^^ problem:;: When 
Vone considers' the .increase of tWjO.-aid-a-^ 
cirorders^of:magmtud 

/^ formation loss between^the two is remarkably::*: 
:i^small.:Because <:SA wa^ 
;SdeliVe^ arid tife b^er'of thej^^ 
f4 able at Jthe;timevwhen'^(3bT^ 
:i|needed to ;ber begun,^all: 
•V .was; performed on. tiiis assemW ^ 

:? 2.6 Mapping scaffolds^ W 

}>!the final step genonie was to ^z, 

b order and orient,the;scaffol^^^ 

:i-.some^!;-;We first gipuped scSfiolds; together on v^ 
:the basis of their order in the components from 
CSA; These grouped scaffolds were reordered ;-. 

^;.by .examining residual ;rnate-pairing.:data be-y=:r 
tween the scafifplds.-We next mapped the scaf- 
fold groups onto the chromosome using physi- ■. 

: ; . . cal mapping data.^lTiis step depends on .haying ; . 

1. - .reliable .^gh-resolution.mLap.^infom such; 

; that each scaffold. woU^ overlap: ^^m^^ mark-;: 
• ers. There, are two genome-wide.types of map 

. information available: high-density STS maps / 
: and fingerprint mapsbf BAC'cTbne^^ 

: at Washington University. (^^^5^ 

nome-wide; STS ,rniaps,::Qen^^ : (GM99) 

v.; ,has the most maikers7and.&erefore w^ most; s 

. usefiil for mapping scaffolds. The two different . 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
. parison pf_ overlapping tB AC , clones. On titie 
other hand, GM99 should have a more reliable 
. - long-range order, because the framework mark- . 
. ers were derived from well-validated genetic i 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 



0% 



< 30 kb 30-50 kb 



50-100 kb 100-500 kb 0.5-1 Mb 
Scaffold Size 



1-5 Mb ' " 5-10 Mb' > 10Mb 



' Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is Indicated. 



:i:-^..In'order.to/deterrhine the effectiveness of - 
the:fingerprint maps- arid CjM99 for mapping 
scaffolds, we first examined the reliability of ^ 
-.these maps by comparison with large scaf- 
■ folds. -Only 1% of the STS rnarkers on the' 10 
: largest scaffolds :'(those '-.v > 9 , . Mbp) - .were . 
mapped ^ph^:-a .^different :i chromosome \ on -^if: 
;GM99.Twp;percem^ 
^Tagreed ^in -ippsition-Jby sin 
^^..worktbins.^vHowever,r-lfbr;^t^ 
^ maps, Var2% vcl^^ 

^r;obseh^edii on jayerage5£23 ;8% Jof^ ACy^^^ 
j{ locations :in. the ' scaffold"sequence^^^ 
with fingerprint map placerrient by inore thsm ; 
five BACs.%Whferi rfurther.-e^^ : 
.r soiirce of discrepancy, it was found that most , 
:?.:qf .the;, discrepancy ^c frpni :4„,of :tfie 10 ;iL. 
^scaffolds) indicatirig this .tiiere>is yanation in 
^^the quality of either the map or the: scaffolds. :;v r 
>;A11 four; scaffolds \yere assembled, W well as: v 
: the other six, as judged by clone .coverage 
/ ; analysis, and showed the sarne low discrep- 
:<ancy :rate to ;GM99r and/thus .welconcluded U 
^;rihai the fingerprint map global-ordet in these 
i - cases was not reliable. Smaller scaffolds had 

a higher.discordance rate with GM99 (4.21% ; -. 
; ■ of STSs were: discordarit by: more ;^i£^ 
, : framework bins), but a lower discordance fate \ 

with the .fingerpririt maps <H^ ;ofyB AGs 
;* :disagreed with firigerprint maps by mbfe than 
^0 iive. B AGs). This observation, agrees iwdth^the v^^ 
i;i;clohe cbyerageairidys^ (^) t^ 
^':v:fbld : construction ::was ; better^suppor^ 
^long-range rnate i)airs in larger scaffolds 'thian ^*^ 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers , (BAG or 
STS) on these maps. Where the order of 
scaffolds- agreed -between GM99; and. tiie - 
^VvWashU BAG map, we had a hi^ degree of 
iiVvconfidence.that that order was correct; these - 
scaffolds A:were- termed ' "anchor sciaffolds." ; 
Only scaffolds with a low overall discrepancy r, 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of . 
individual scaffolds was determined by tiie . 
presence of multiple 'mapped markers with . 
consistent order. Scaffolds with, only one. 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- ; 
nome in anchored scaffolds, more than 99% - 
of which are also oriented (Table 4). Because \ ; 
GM9? is of lower resolution thari the : 
map, a number of scaffolds without STS 
matches could be ordered relative to the an-,.: 
. chored scaffolds because they^ included,, se-..^ 
quence from the same or adjacent B AGs on 
the WashU map. On the other"hand, bo^ 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be ''unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99:* These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- . 
nonie was ordered unambiguously^ 
, Nexi-'alfscaffoldslth 

. but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds arid'were deemed to; be "bound-- 
ed" between them;- For exainplel,^ small scafr 

..folds havirig'^^jSTS hits^ same Genfe- 

;.Map bin or Hitting^ 

.;;ordered:reialive;;tb*eac^ ciri^be;:: 
assigned a placement boundar>^ relative to^ \ 
• Other anchor^ scaffblds^^^ The > 

: remaiiubg ^sj^jffb^ mrlockliza^- 
, tion irifDmatk)^^ 
or could only .'be /assigned to .a ' generic 
chromosome location.* Using the above ap-v 
preaches, ' -^98% of t^ genome: was an- ; 
choired, bniei^d;^ or •b6unded.3^:"^^ 

/FinaUy,' we a^^^^^ 'a IbHtion Yor e^^^^ 
scaffold placed on ' the ' chrbmpsome by 7 
spreading out' the' scaffolds per chromosome, i " 
We assumed that the remaining unmapped 
scaffolds, 'Constituting: 2% -of :the:..genoine, 
were distributed evenly; across the :genome/^ ; 
By dividing the sum of unmapped scaffold 
lengths with the Slim of the number of 
mapped scaffolds, we arrived at an estimate \ 
of interscaffold gap of. 1483 bp. This gap was r 
used to separate all the scaffolds on each ^ 
chromosome and to. assign an .offset in the ; , 
chromosome. . ^^ 
- .During the. scaffold-mapping effo . ■; 

countered many probleiiis jthat related in ^^^^^ 
tional quality* assessmeiit and. validation analy- 
sis. At least 978 (3% of 33.173) BACs v/cre 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs "could not- be '. assigned to 
unique positions v^ithin the CSA assembly and . 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overiap, but no overiap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
116,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
irig this data was that doing so would have 
biased certam regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of botti the assembly and ' 
gene definition processes more difficult: ' 



The HUMAN GENOME 
v .? Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives ^''of-;c6mpleteness 
. (ampunt ,.of, coyepge; of and 
'^■ correctness^^the stmctural^jacciuacy^^ 

order and. pneritatibn'and^M^'c^^^ 
p quence of .the assembly) --^^ J- v ; f 

; : Completeness Cpm^^ 
the percentage of ^the' e^^^^ 
:: representeid in ;to;assemb^^ canriot be \ 
'^!^^^;wi&j^^ 
11 chromatin;l^is0^ 
^\vHowwei:},it^ 
^ ness on^tiie basis::bf (0^ 
c jntrascaffoldijg^ ; 
"/published !cliroiii6s^ 
rand;(iii)^;aiia^^ 

J independent .: set .'of random;'^4uerices' (STS ' 
::markers). ' contained ;inrthe;^a^ 
V whole-genorne vlibraries,. c 
■ : ^^"^^*f?^?9^®?ce 2^ 

.i ; beeri made^b asisemble' it;^there rnay' be . iii-'^ 
> st^ces of :unique seiquenc^embedc^^ in^re-^^v 
; , gions of heterochromatin^a^^wer^ observed in ■ 
Drosophila (50, 51). r ' '.r-' ' ^ } 
,f\ vThesequences of humanchromokomes 2 ii 
and 22 have been' completed Wiigh^'qud^ 
and published (48, 49)., Although this se- 
quence, served as input to the .assembler, the : 
'.-finished sequence. was shredded into a shot-\ 
gun data ;set.sp ;th^t 'th^'*assiSble^' had; th^'^ 
opportunity to assemble it differently from. 
:/ the original sequence iii the case oif structural 
■ / polymorphisms or "assembly errors in tibie 
:■: BAQ^ata: .In pa^ Ss^ibler n^^ 

^; be. "abk^; to re^olW 

■ . scale ; of xomponerits \ (generally iirmltimega-: 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the. assembly struc- 
ture differs from the published versions of 
chromosomes 2 Land 22 (see below). The 
consequeiice. of the flexibility- tp^ assemble 
"finished" sequence, differently on the basis = 
of Celera data resulted in an. assembly with 
more segments than the chromosbme 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera' sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that coilectively span 
94.3% of chromosome 21. Sbcty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeal elements. : - 
A more global way of ass^sing complete- 



ness measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemaf>99 
^ " (-^^^ to the/scaffolds; :Because;&^^ 
|i?^?^^^f?^.'^sedi^ they ^ 

."^r prodded a;^ com-" ' 

ple&ie^;> ePCR (53) ^and BLAST -(J^^j" were : ^ 
\^ used to locate STSs .onithe assembled geriome. 
^^:WQ;-^y^A4^ (91%) ofthe^ STSs, iri the 
^.■:;;;,mappe<l;ger|<^^ markers 
|^(5.4%y,;\i^re;3^^ 
ii^ j/sembl^:xi^ ;Vch&fil!^ 
i^S^STS:maidkere (2.6%) mtfblij^ 
>a^sequaicfe-o^ 
^^rai^^|hep^ 
3^hbt;be'oflxu^^ 
p4ie .Celera assernbled seSju&ce w^^ 
,;-:; 934% bf Ae Jn^ 

^::vsembled data 5.5%;fQr;a total y>f98'9%:c6ve^ 
2^;iage.:>Sinulariy;;we xoi^^ 
:^36,678 ^TNG :^diation)hybrid 'markers l(55a)^ 
: using ihe isame me&oil; We found that 32,371 *; 
markCTS (88%) ; were .:16cated ; in: :the ^niajiped - ■ 
^':'CSA" 'scaffolds, Cwiff - ^bss^marke^^ (5.6%) . ■■ 

" found in the remainder: This gave a 94% cov- 
r ;erage of the genome.-throu^ another genome- ■ ; 
;-'wide survey: -'o 

Correctness: Correctness is defined as the 
. - Structural and sequence :accuracy of the as- ; 
j^\: sembly. Because the source sequences for the ^ 
. rCelera data "and the 'GeiSaiik da^ ' ■ 

V, different individuals, ^ we could not 'directly 
compare the 'consensus ^sequence of the. as-;;; 

NpTable:4. Summary piF iscaff^ 
v) were mapped to the genome with different levels 
:Vof confidence'(ahch6red scaffolds have the highest v 
' confidence; unmapped scaffolds have the lowest).^ 
Anchored scaffolds were consistently ordered by 
: the WashU BAC map and GM99. Ordered scaf- 
. folds were consistently ordered by at least one of 
; the following: the WashU BAG map, GM99, or 
■■ component tiling path. Bounded scaffolds had or- 

• der conflicts between at least two of the external 

* maps, but their, placements were .'adjacent to a 
- neighboring -anchored -or -ordered scaffold. Un- 
; mapped scaffolds had. at most, a chromosome 

assignment. The scaffold subcategories are given 
below each category. 



Mapped 






.% 


scaffold 


Number 


Length (bp) 


Total 


category ' 




length 


Anchored 


1.526 


1,860,676.676 


70 


Oriented 


1.246 


1,852,088.645 


70 


Unoriented 


280 


8.588.031 


0.3 


Ordered 


2,001 


369.235.857 


14 


Oriented 


839 


329.633,166 


12 , 


Unoriented 


1.162 


• 39,602,691 


2 


Bounded 


38.241 


368.753.463 


14 


Oriented 


7,453 


274,536.424 


10 


Unoriented 


30.788 


94.217,039 


4 


Unmapped 


11,823 


55.313.737 


2 


Known 


281 


2,505,844 


0.1 


chromosome 








Unknown 


11.542 


52.807,893 


;-2 


chromosome 
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;.sembly„against o&er'tesh^^^^^ 's^cncf ioi r^' 

c .detemuning sequencing acciiracy ;at'the, nu^ij'. 
cleotide leyel, although, thislias b for 4 

identifying polymorplusins" as^ described in . 
..Section 6. The accuracy of tfe consensus' 
sequence is at least ^9,96% on ^ of a';!'' 

statistic^, estimate deny^^ 
values, of the underiying reads. '^f^cV-^'^^^h^ 

: >\The structikal consist^^^ 

vcan;be !measured by. matcrpair .^a^ 

/.correctvassenibly,- , eyery";^ 
quencing reads^should jielocatea .ra^^ cpn-;^:v ' 
, sensus.;:sequence .with ..th_e>.c^irecF sepaiati .4 
and . orientation ;betweeri ^thie pairs.' ' A pair .:is 

;:,tenned .:!>alid"; when .the^reads 'a^ in ^the 

;:correct orientation! and: the;,<^ 

!;theni.:is within &ejne^^ 

, ations of the disttibution of insert sizes of the^;/::i 

^ library from which the" pair' was sampled. A hv*^ 
•pair is termed "niisoriented'^. wh'eri 'the reads A 

! are not correctly oriented, and is temied "mis- 
separated" when the . distance between the 
reads is not in the. correct range but the. reads i:. -;^ 
are correctly oriented. The, mean ± the stan- 
dard deviation *of each library used by the, ;; 
assembler was ; detemiined as - described y > 

7 above, V To validate , these, we , examined, all 
reads mapped to tiie finished sequence, of 
chromosome 21. (48) and determined how. : 

:inany;. incorrect mate paurs there .were; as a 

. result of laboratory .tracking .errors and .chi-.-^ ie; 

V merism (two . different segment^" of, the . ge- J,::;^ 
nbme cloned.ihto.thie same"plasnud), and how 
tight the distribution of insert .sizes was for 



T H E H U M A N C E NOME f 

those ;that .were^COTek Theistan-: 
/dard deviations; for i"ali vCelera libraries' wef e' 
quite smal^/less thah^l5%^'0 
length, with the exce^tidn%f a ' few 
libraries. The 2-;Wi ^lO^^ 
? tained less than 2% inyalid m 
as.the 50:kbp;Ubraries were higherj 
(■^ 1 p%).';l1ius;yaltiro^ 

mation was not .perfect; iteracci^ such': 
:titiat iinea^umg^^ :and ^mis^v 

'separated pamra'with'res^ 
;bly. iwasjdeem^^ 
:for^vahdation piupp^ 

;eral mate pairs -confirm W.deriy ."aii ordering.^; 
t4;iThe,,cbheico^^^ 
:39,:>< meaiimg^&^^ ,£my^. given base! pair' 'waSj^ 
i9P ?Y^rage,icontem^ 39 .clones or; equiv-;; 
aientiy,' • spanniMi ;:byr-3 9^mate-p reads. ' 
Areas . of low clone coverage or areas with a 
.high propoitionl^of Mvalid 'mate p would 
indicate potential asseniW^^ ^We 
computed thjelcoverage of each base in the 
: assembly-.by >^id ymate' pairs ; (Table ^ 6) .1 In 
summary, :for: scaffolds >30 kbp in length, 
less, than 1% of the Celera assembly was in 
regions of, less than 3X :clone coverage. Thus, 
more than ;.?9% ;of the -iassembly, including 
order and. orientation,, is, strongly ^supported- 
by.this measure alone. . : . . -/. / .v . .\ 

\Wo examined the. locations and nmnber of. 
all . misoriented van4:.nuss^ 
addition to :^oing.'tius;.'a^ 
*as$enibly;:(as]<)fVl>Oc^ 
performed a study of the PFP assembly as of : . 



:V5^Septembei:.:2pOa^ In this latter 

i'caisK; Celera ma^^^ to be rnapped to r 

^ithe^PFF /assembly.'To avoid mapping errors 
^due- to^high-fidelity repeats* the only pairs : 
-mapped .were -those "for Swhich both reads 
t}matched^at .only one location, wth less than'- : 
j 6%'differences".^^ waV;^set such that;i 

-•iets ; of' five '/^or^more ^simultaneously iiny^ 
:r;inate7pairs jin'dicated'^ 
;;^where'the constniction 
iVdifferei flie-grapluc^ CSA 
;;:chrom'6s6rne 21 assembly 
i: sequence (Fig. '6 A) .iserves'as' a validation of ^ 
: tiiis ! methpdblogy/^BIte '-tick^^^^ /in 'the. " 
^panels; indicate :;brealqp*oints.-^.'l^ a ' 

r^:similar/(sniall) i:niimber-?jofibreakpbirits on . 
i:both':clu:oinosomerseqflences..^^ ; 
■fwas 12.' sets of scaffolds in the Celera assem- ^ 

■ !bly (a:total of 3% of the chromosome length 
in V2 12 . , single-^ontig ^rscaffolds) ^that v-.were 

■ mapped to the wrong positions because they ^ 
were too small to be mapped reliably. Figures ; ; 

t^6 :and:7^arid Table^6 ;illustrate;the triate^air^V 
differences and breaIq)oints between the two 
^ assemblies.' Tliere was a higher percentage of 
misoriented ,and misseparated mate pairs in 
;:the Jarge-insert libraries i (50 .;kbp Aand ^ BAQ j 
: ends) than in the small-insert libraries in bpth 'i 
assemblies (Table. 6). The large-inseit librar- i 
::-ies. arC; more .likely to ideiitify . discrepancies m 
;.isimply.because they span'a 
^the^;:geriome:^^ 

itween ; the ;twd :assemblie for chromosome; 8 .'^r 
(Fig/6, B^and C) shows that there are mariy[ 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to : of mate pairs tested). If the two mates had incorrect relative orienta- 
the published sequence of chromosome 21. Each mate pair uniquely . tion or placement, they were considered invalid (number of invalid mate 
mapped was evaluated for correct orientation and placement (nurnber ^ ..pairs). , „ . . i - . . - 



.Chromosome 21 



Genome 



Llbraiy 
type 


^ Library 
no. 


■; - Mean 
Insert 
size 

; (bp) 


SD 
(bp) 


SD/ 
mean 
(%) 


-No. of 
mate 
pairs 
tested 


. No.of ^ 
invalid 
mate , 
pairs 


% 
invalid 


Mean ' . 
" insert 
size (bp) 


SD 

(bp) - 


mean 
(%) 


2lcbp 


1 


2.081 


106 


5.1 


■ 3.642 


\ , 38 


1.0 


* 2.082 . 


. ' 90 , 


4.3 


2 


1.913 . 


152 . 


7:9 


- 28.029 


413 ' 


1.5 


. 1.923 


118 


6.1 




3 


2.166 


175 


8.1 


; 4.405 


■ v; . 


1.3 


, 2.162 


158 


7.3 


10 kbp 


4 


. 11.385 


851 


7.5 


. 4.319 


^ 80 


1.9 


11.370 . 


696 


6.1 


5 


14.523 


1.875 


12.9 


7,355 


156 


2.1 


14.142 


1.402 


9.9 




6 


9.635 


1.035 


10.7 


5,573 


109 


2.0 


9.606 


934 , 


9.7 




7 


10.223 


928 


9,1 


34.079 


^ 399 


1.2 


10.190 


777 


7.6 


50 kbp 


. 8 


. 64.888 


2,747 


4.2 


16 


i 


6.3 


65.500 


5.504 


8.4 


9 


53.410 


5.834 


10.9 


914 


170 


18.6 


53.311 


5.546 


* 10.4 




10 


52.034 


7.312 


14.1 


. 5.871 


569 


V-".- 9.7 .v-;- 


51.498 : 


6.588 


12.8 




11 : 


^ 52.282 


7.454 


14.3 


2.629 


■ ■ 213 '- . ' 


8.1 


52.282- 


7.454 


14.3 




12 


; 46.616 


7.378 " 


15.8 


. : 2.153 


' ■ -215 ■ 


16.0 


45,418 


9.068 ' 


20.0 




J 13 


55.788 


10.099 


18.1 


2.244 




11.1 


53.062 


10.893 


20.5 




14 , 


39.894 


5.019 , 


12.6 


199. 


.7_ 


3.5 ; 


36.838 


9.988 


27.1 


BES .j 


. . 15 


. .48.931 ' '* 


9.813 ■ 


20.1 \ 


144 


^1:: io " 


6.9 - 


47.845 , 


4.774 


10.0 




.16 


'48.130'^- 


4.232 -r- 


8.8 • 


195 


^-j-f- 14 :* 


7.2 ' 


47.924 


"4.581 T 


9.6 
17.5 




17 


■106.027 -V 


27,778 : 


26.2 : 


330 


: ^ 16 


4.8 ^ 


152,000 


26.600^ 




18 


160.575 ' 


54.973 ' 


34.2 


155 


8 


5.2 


161.750 


27.000 


16.7 




19 


.164.155 . _ 


19.453 


11.9 


642 


44 


6.9 


176.500 


19.500 


11.05 


Sum 










,*l62,894,^ 


2.768 . 


2.7 




I; . - I 














"i 1' i \" 


:* -.'("i^^n = 2.7) 
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ISpoint map (blue tick, marl^^ft^^ 
I ? - of each chromosome m a side-oy 

1% >lio ?kbp) in both- assemblies , as rfd ttc^^ 
pSSsSe GSA^ass^ly;^|K5^.?;^. 



'^^'iummary. To enuinerate the>ene inve^rj^ 

S^^tSa^^^ S^ to ^ 

•>fJS to other protems. A;<sbmpanson of Otto - 

.Soenscan. a standard e«"e-pred.cUon aU 
■ gorithm. showed Sweater sens,tivity (0^8 ver , 
So 5ff) and specificity (0.93 versusWS) of, 
;^^SSthe aHlity to define gene structure. , 

aSSo^cied:;fc.A^^^ 
•SithLet of genes fromthreege^^^^ 

programs that exhibited weaker, but stll «g 
nificant evidence that they may be ex 
Conservative criteria, requinng at 

Extensive manual curation to establish pre- 
■^S^Uterizationof^nes^^^^^^ 
. necessary to improve the results irom 
initial computational approach. 



3 1 Automated gene annotation 

A gene is a iocus of cptranscribed exons^ A 

Si^rd^s^^SaiS^^F 

SmuSe functions, by means of altema- 



THE. HiJ MA hi GENOME 

SspUcingaruialtemati^^ 
tiation im^ =termination;:sites,,Cmr.^^ ^ 

^*£&th^g^noiuc.I>NA:the,s,^^ 

Sing^transcript^i^ag^^^ 
, gether exons^separated by a tew or n 
^ofth6iisanasWbase,pim..The^^,fi^ 
>chaacteri^;th^|^non^^ 

:;SS£er.istin«tes;fi.m:tiie torn w^ 
■ ^000 (56):Tvlore ieceht data fromi.oth^;^ 
:^SlSrkk;piMc,sectprs. W o^ 

SS density^rbased extrapblaUons^h3V^,ir 
^^SSci th^bianoe/Thehighesttecent^j 

frnm Incvte Pharmaceuticals, and is 

VassodationqfESTs'withCpGi^^^^^^ 
.in stark contrast arethreequjedif^^^^^^^ 

■ much lower estimates:.one of ^.WJ?^ . 
:< derived ■with .genQBie-wide ES^ 

i^^'SgSSwithacompar^ 
iffililfenvolvings^^ 
^^On:bbt«^e|hum^s^^^^^ 

and predicted genes m the 67 . 
mosomes 21 and 22. to the approximately 

. 3.Gbp euchromatic genome, 

The problem of computational identrf^a 

" tion of Smscriptional units in^genomicDNA ^ 
J seq^eXcb be divided into two phases. The. 
SL Spkrtition thd sequence into.segmente . 
2 ire likely to conespond to individual 
teL^sisnottrivialandisaw^^esso 

most de novo gene-finding algorithms It is 
Z critical to^etermining the nu^er^^^ 
eenes in the human gene mventory. The sec 
S chilenge is to construct a gene mo J 
L reflects the probable ^^'^ 
transcriptCs) encoded in the region. This can 



Genome 
Ubraiy , 




GSA 

% • 
mis- 
oriented 



% . 
mis- 
separatedt 



valid 



PFP 

% 
:-: mis- 
oriented 



mis-, - 
separated! 



Zkbp 
lOkbp 
SOkbp 
BES 



98.5 
96.7. 
93.9 
94.1 
97.4 



0.6 
1.0 
4.5 
2.1 
1.0 



1.0 
2.3 
1.5 
3.8 
1.6 



95.7 
81.9 
64.2 
62.0 
87.3 



2.0 
9.6 
22.3 
193 
6.8 



be done with reasonable accurapy whep a 
full-length cDNA^has been sequenced or a 
Highly it hi. mcilogbusvproteiii^sequ . . 

less'accuratd^' is .to- ody^way;^o :find genes 
.that are not rc^.i^eiited; by;:h6md^^^ FO- ;; 
tetas or ESTs. The followmg ^section de- 

Lotatoru^ iden^^^^^^^^^^^ . ; ; 

evideii«lHiivided.by.ftec^^^ 
te typ^s of evidence relate -to one mioft^^^^^ 

SiJ-,type^^fl^dehce;and^ooks for . 

JSTp^Ls of evid^c^^ : >: 

Wtatton.Fpr:«ample^^^^ ^ 
..am&eyrtplosrtoavnmbe^^^ 
. evaluate whfetiier or not they can ^^^^ 
V ed into a longer, virtual °iRNA^^ 

;, would also evaluSte ^^^^ 
ity :aiidW»Pty ■lof.the^c^ 

essence asking whether any ESTs cross 

X; ^ctioJ; Bid.whefe^^^^^^ 

■i Stive ex6ns have -ctinsensus jsphce ..sites. . 

evi^?^^^ inSanni^cm in one of^ ^ ; , 
w"ys. First, if the evidence incluf ahi^ . 
quality match to the sequence of a taovm 
Sne [here defined as a>uman gene repre- 
sented in a curated subset of the RefSeq 

a gene amiotation; In Ae second mf^""^- 
eviuks rbroad 'spectnim of evidence and 
SrtLines if this evidence is adequate to 
aetermuiw annotation, 
support promotion to a gene 

processes are described below 
IniLy.geneboundariesarep^dK^^^^^^^ 

the basis of examination of sets of overly 
SgproteinandESTmatchesgeneratedbya , 

Tmputational pipeline (52). This p^'^! 
searLs the scaffold sequences ag^nst pro 

tein EST, and genome-sequence <fetabases to 
defie regions of sequence similanty and 
SrLe^denovogene-predictiouFef^: . 
To identify likely gene boundaries. i^ 
. gioi of the genome were partitioned^y^^^ 
nn the basis of sequence matches identifiea 
hv BLArT Ea«ih of the database sequences 
'Jat!i^l;t region unde^^^^^^^^ 

™to as «dl «. «« "^Sfc 



2.3 

8.6 ^- 
13.5- 
18.8 

5.9 



Mean 97.4_^ 

full/291/5507/1304/DCl. jMates are mi» P 



prote n. fcbi, anu »"'"v- , ,3i 

Ssed to group the matches into bms of rela^ 
:inc?s that may define a gene,and identify 
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gene boundaries. During this process, multiple., 
hits to the same* region were collapsed to a . 
coherent set of data by tracking the coverage of. 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of* 
. ESTs on the . scaffold was . marked as being : 
.supported by EST evidence. This resulted in a,: 
series .of "gene bins," each of which was be-.-; 
lieved to contain a single gene. One wealaiess of : 
this initial implementation of the algorithm was 
. in predicting gene boundaries in regions of tan-..; 
demly . duplicated genes. Gene clusters frequent- , 
ly resulted in homologous neighboriing genes 



THEHUMANGENOME 

. .being joined together, resulting in an annotation.: 
:> that artificially concatejiated these gene models: 
;;v, Next, known genes (those with exact match- 
v'. es of a full-length cDNA sequence to the ge- , 

:nome) were identified, and tiie region corre- 
v . sponding to the cDNA .was , annotated as a 
, predicted . transcript. A -.subset: of .;the ncurat- . 
; : ed. human gene, set RefSeq from. the Nation- 

al Center for > . Biotechnology ilnformation 
.;-.,(NCBI) was included, as a data set searched m . 
^ . .the . computational . pipeline. If a RefSeq :tran- 

script matched the genome assembly-for at least 
. . : 50% .of its length at >92% identity,- tiien the 
/:SIM4 (63) alignment of the RefSeq transcript to 



. .. ■■■ ■ . ; ^r^fp::^^a;.■. * 

> . the. region of the.:genome undeFmialysis was 
.promoted -to the status *of an Otto annotation,; 

C^iBecause- the .-geriome sequence has gaps and 

. sequence, errors such as franieshifts, it was not 
always.possible' to predict a transcript that 
:agrees precisely with the experimentelly deter- 1 

V;mined cDNA sequence.'A-toMbf ,6538 gene^ 

.-in bvi inventory, were, ideiitifiedaini^^ 
predicted in this way; iJ; 
)Regions that haveiarsubstantial mo^ 

x seqiience similarity,^but do.jipt'match 
.. genes, were analyzed by.that plitpf th'eOttb'^^^^ 

...system that. uses the sequence:siimliity;in-'v 
formation to predict a transcript. -Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8. 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations . 
(red, misoriehted; yellow, incorrect distance between 
the mates) for each assembly grouped by librafy size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCl. 
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> .bases flanking these regi6ns).-.The other bases 
in the region, those not .covered by any homol-; 
i;o©^ evidence, we replaced by ;N*s; .this- s^ 
:;quence segment, with high confidence regions • 
irepresented .,by % the , consensus;.: genomic se- ■: 
r quence and the remainder represented by. N's,' 
...;-was; then evaluated, by Genscan .to .see. if a 
: : V consistent gene model could be gerierated.-This: 
■ ;:^proceduie.;simpMed.the/generprediction 'task:, 
.; ::by first establishing the boimdary for the geiie ^ 
.-^not avstrength:;ofvmost gene-finding algo-^ 
• rithins),tand >byjeliininating ^regions i^th . no ; 
b supporting .->vidence;.';If.' Genscan -.returned va 
-i^ "plausible gene inodel, it .was"fiirther;evaluated 
. v-before. being promoted to an "Otto": annotation, 
v. ; The final Genscan predictions were. often quite; 
; : different from the prediction .that Genscan re-;: 
returned on the same re^on of. native, genomic 
: • sequencer A' weiakness"^ using Genscan to 
' refiine the gene model is the loss of valid, small 
exons fix)m the final annotation. ' • 
The next step in defining gene stmctures 
^ .i. based on sequence similarity, was to compare, 
each predicted transcript with the. homology-' 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
' edge was required to be within 1 0 bases, but the 
:. . ..external edge was allowed greater. latitude to. 
allow for- 5'.. andi . 3 y. untranslated .regions; 
(Unis). ::.To . be retained, a prediction for a 
multi-exon gene must have evidence such that 
■ the total number of "hits," as defiiied above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sefluence..A single-exon gene must be 
covered by at k^t three supporting 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 



Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning tiie prediction to tiie published 
RefSeq transcript, tallying the number {N) of 
uniquely aligned RefSeq bases. Sensitivity is tlie 
ratio of N to the length of the published RefSeq 
transcript. Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



those v-s>. ^-^^^.^ : 

vvpredictions.'vHomology-based..OttOvpredic-;^.^^^^^ system is quite 

sitions do hot cdntaiii^SI.arid 5^\ihti^nslated/^r:conservative^we used.a diffe^^ • 
•^^vsequerice. Although three'de novo gene-finding where the ho- - 

■liprograms • [GRAIL,'.Genscan, :v and : FgenesH >^ mology evidence .w^^ strong. Here the 

■ (di)].were trun -as part of the rcomputationd :>.v results rof .de : were 
;^:-^analysis^ the resdts of these program that a 
;4diTectiy^used .in : making the .Otto: j)redictions. -9^^ transcript.have at least two ;of the 

■ 4 Otto-predicted 4 1;226 ^additional rjgenes^.by^^fbllowing^types .of evidencOto^be.included - 
>^Snieans 6f sequence simaarity.^r.^^ the^gehe set ibr-fiirther analysis: protem.-- 

. 1 , 'M>^'':^-'^^' :^.^.:.:>^^Mi^y4T:X^i^^^ EST^^rodent EST; or mouse genome : 

4^3^2- bttx>.MUd^ S^^-ii^^ ^a^:^^:; J^} :;:;Kfi:agment matches: :This;;f^^ ?!ass; of pre- 

v^^Tb yadida#&€^^ ;%dicted Agenes is 'a tsubset^af the ^ 

^k- and the :^thod^that : Otto .Tises ito; define .the^iv^made^^by- th^. three .gene-finding tprograms ^ 
■v-^stmctures of known genes,rwe compared tran-.:<:.thatwere .used In the computationalLpipe- > 
^ -sdipts predicted by Otto with their.correspbnd- line; vFor .; these, v. there ^^^^ not -sufficient 
^ ving iand .presumably correct) transcript fr^^ 

set of 45 12 RefSeq transcripts for which there vr^:- attempt , to ^predict a -^gene .'structure. vThe r 
: :-;was -a %iiquevSIM4 alignment (Table/7).^In ^^^^hreeide. novq^gene-findmg^^p^ 
l^vorder to evahiate theirelative performance^- of-irfesultedJInVabout 55^695- predictions, - of 
^-Otto and Genscan, we made three comparisons. :^,which ^--76^10 were:;nonredundar^ (non-' 
' The first involved a detemiination of the accu-.v-overlapping^with one .another).: Of these, 
Z racy of. gene models predicted by Otto .with JA.57,935 >did noUoverlap ; 
V -i-only homology data other than.ihe correspond- .-^predictions made by ;^Otto.^= Only 21.350 ot 
ing RefSeq sequence (Otto homology in Table V-t that did not overlap 



. 7). We measured the sensitivity (correctly pre 
. . . dieted bases divided by the totd length of the 
: A cDNA) ' and specificity ' (correctly predicted 
. ..bases divided by the sum of the correctly and 
■incorrectly predicted bases). Second^ we exam 



Otto predictions were partially supported 
:by at least one type of sequence similarity 
V evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 
The sum of this number (21,350) and the 



ined the sensitivity and specificity of. the Otto .'..^number of Otto annotations (17,764). 39.1 14, 
^ predictions that were made'^olely with the Ref^ u^^is ^nearthe .upper limit for. the hurrmii g^ne 
^^Seq sequence, which is the ;proc^ *at Otto :r;r:complenient:vAs: seen 
;^uses to:anhotate.toown .genesXOtto^ 
And third, we determined the accuracy of the - made more stringent, this number drops rap- 



Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
. and Otto-hoinology performed better thari Gen- 
scan by botii criteria. Ibus^ 671 % of ftfe RefSeq 



idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to —23,000. 
. Requiring that a prediction be. supported by 
"all fourcategories of evidence is too stringent 



: nucleotides were .nonrepresented in the Otto- ; because it would elunuiate genes that encode 
refseq annotations and 2.7% of the nucleotides novel .proteins- (members'of currently unde- 



Method 



Sensitivity Specificity 



0.973 
0.884 
0.633 



Otto (RefSeq only)* 0,939 
Otto (homology)! 0.604 
Genscan ^O.SOl^ 

•Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. tRefers to those 
annotations produced by supplying all available evidence 
; to Censcan. * *. 



in the Otto-RefSeq transcripts, were not con 
tamed in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences, between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experirnental evidence for iritervening exons 
iriay iriadvertantly re^t iii a set of exons. that 
cannot be spliced together tp g?ve rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all tiie evidence should be com- 
bined into a sin^e transcript We also examined 



scribed protein families). No correction for 

pseudogenes has been made at this point in 

the analysis. 

In a further attempt to identify genes that 

were not found by the autoaimotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice jvmction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol-. 
:iowingevklchcc types— h^^ to iriouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs— or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous . paragraph 



the tendency of these methods to incorrectly - would give \xs estunates of about 40,000, 

split gene predictions. The^ trends are shoN^ 24,000 potential genes m the 

in Fig 8 Both RefSeq and homology-based human genome, depending on the stringency 

predictions by Otto spUt known genes into few- of evidence considered. Table 8 illustrates the 

er segments than Genscan alone. . . number of genes and presents the degree of 
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; confidence based on the supporting evidence. 

Transcripts encoded by a set of 26,3 83 genes 
: - were assembled for further arialysi^.l'rtis set , 
;;:-;^incIudes the,6538 genes predicted by Otto on ^ 
-tifitiie .basis;of ihatches to knovm;genes, 1 1,226^ 
J, transcripts predicted by OttcTbased orihoAiol-- 
/,V^.Pgyjeyidence,;and :8619 ifrom the subset ofv 
transcripts, from de novo gene-prediction pro- f 
^grams ^hat ^haye two; types :of supporting ev- ^ 
.^V^idence;pe;26,383 g;enes arei^ 
j,^:.chroinosome;di^ams^mi^^^ a t 

i^ii^ei^^prelinii^ and are 1 

. subject to aU^tiie lmitatioris of an automated > 
-9^-^prccess.; CcMasid^ still nec- v 

^J^essaiyjpim^ 
;.;^;..scnpt<i^dict^ 
( >;,^descript^iis of genes .and the. associated eyi- 
; ;>:rcience :thatVwe present^ are. the ^product of ::' 
I : . completely^computational processes,- not ex- r 
.j. pert. ciiration.^e have attempted to^numer- ^^- 
// ;. ate the genes in the human genome "in such a J: 
y . .way- that, we have different ;levels of, confi- 
. >. dence . based, on the amount of -supportmg 
.^evidence: known genes, genes with good pro- 
^ ^, , tein or EST homology, evidence, and de novo 
t;,gene predictions -confirmed. by modest ho- 
mology evidence. ; . - 

3.4 Features of human gene ^ 

We estimate the average span for a "typi- 
. . cal".gene in the .human DNA sequence to 
..be about 27,894 bases. This. is based on the 
.r.ayerage , span -. covered ; by . RefS.eq tran- 
.: , scripts, used.becauset it represents our high- , 

est confidence; set., / c. ■ 
: ' . The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas - 
^ -those- ptomoted-from -gene-prediction -pro- - 
grams average .about 3.7 exons' The. largest 
number of exons that we have identified in a ; 
: transcript is 234 in, the titin mRNA. Table 8 
compares the amounts of evidence that sup- 



T H E H U M A N GEN O M E (T 

' port^the Otto.and other predicted transcripts. 

For example, one can "see that a typical Otto 
; transcript has 6.?9 of its 7.81 exons suppdrted - 

by protein /homology evidence. As 'would be 
^ expected, the'Qtto transcripts general^ have 
. '.TOore support than" do transcripts predicted by 

thegde novo.methods. - 



;:^4_,Genomef;i5tmc^^ _ 

r ^e^^onc(^^ vpf vth&^senibied 

/genome Sequence and. their, con-eiations ; 
■fi^ predicted gene, set^ These include an anal- 
.ysispfG+Qcontentand 
rcontext^pf cytogenetic nia^ 
m ^^ieratiye an^ysis;of ,CpG 
;,a brief description of the g:enome-wide repet- 
ritive eIements.'v-,:v^/>5 ■^'V3;^.^c{r^/-->U:;^';.>.^^ 



4,1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
>rmost^\qsible,ielement.-o^^ of 
-th^tgenom^^ the banding pattern produced 
;;*Kvby^-Gie:msa istain? -Chroniosomal banding 
-^^jstudies;^ hav^^evealed ■ ^h^t '^'abbut 17%; to 
•.•;<20%'c6f';the human chromosome : comple- 
vv?:^;^?^t^consists .^of .C-bands, Cir constitutive 
;^;;:;heterpchromatin 
;>^chrp^ 

^f?!^ts;lc^difpBrent ^familiesjdf ;il^h^^ 
•'¥:P^^ ?with ^H'ariouk; hig^ 
::S;Structifiei^(55).!Manyichrom6s^ have ' 
:|fc^<^plex\dnter^>and -intrachribmospmal ydu- ■ 

4)lications ^present ^;in pericentrbmeric i.re- 
Kgions (d5VAbout 5% of the:sequence reads ^ ; 
; ;;^:Were identified as alpha satellite sequences; ' 
*-^;these/wereVnot included in 'the assembly! 




fl .,btto (homology) 

□ Otto (RefSeq only) . 

□ Genscan-:v v. .r^-: 



: - 0 -1 .2- 3^^-4 , 5 ' 6 7 8 • 9 u'10 -Il ia -13^-14 15 16^17 
Number of predlcllons per RefSeq transcript 
^imAiilf'^-'* °f lP"t genes resulting from different annotation methods A set of 4512 
foSf.r ^ ^Tr""" °l ^'^^^'^ ''f '"'P*^ 2^"°'"'= assembly were chosen (see theTe,^ 



Number of 
; transcripts 
' ' •■ ■' '^ ' Number of 

' , . exons 

Denovo. . Number of 

; J ^ transcripts. 
- , ; ' 7 Number of * 

exons 

No, of exons per Otto ■ " -i- 
transcript r^W y\. Me novo ^^r.^jr^i^ssi ? 



. Total 

17.969 
141.218 
58.032 
"319.935 
7.84 



Types of evidence 



^ ^, No. of lines of evidence* 



Mouse 

17,065 

111,174 

14.463 

48.594 

5.77 
3.17 " 



Rodent 
14.881 
89,569 
5.094 
19.344 

6.01 

>3.80 n 



Protein 


Human 




^2 


S3 


. 2:4 


15.477 


16.374 


17,968t 


17,501^ 


15.877. 


/ 12.451 


108.431 


118.869 


140.710 


127.955 


99.574 


: 59.804 


8.043 


9.220 


21350 


8.619 


4.947 


1,904 


26.264 


40.104 


79,148 


' 31.130 


17.508 


6,520 


6.99 

^v3.27'^: C i>. 


7.24 

''436 


A81 


. ' 7;i9 
;3.56 


6.00 

3.42 • ^ 


4.28 
3.16' 



consiaerM to support gene predictions frbm the different methods. Ue use of e^den^l.n^^ ■ . <:DNA, and slmiUrity to known proK 

number Indudes alternative spUce fbm,s of Jhe 17764 g^ ^^n?^^^^^^^ 
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W t^- ^^"^ ^t' 
Examination .of pericehtrdniericvregions is 

^;ongoing. -.V ivLi^r^^^j^ 

/ The remaining ;rr 80% of .the "gerioine,?the 

ip^'eucl]ff6rimtic;;component,^:is . divisible : into' G-, ' 

R-, and T-bands (57).-These .cytogenetic bands / 

, ; have been presumed to differ in their nucleotide 

. composition and - gene : <iensity, ^although .we/^ 

• - ;. : liaye vbeen . .unable .to : determine . precise \ band 

'.}■' i/.-^^^^?^,^} molecular .level. ^T-bands, aire ; 

■ . : the^moS G4-C-;a^^ gene:ricb;;and G-bands are^' 

v' ;.G+.C-poor.X^^./^^ a/ 

. J desdipti^^ of &e eucbromatin at the moleciilar | 

. ' y' -.levei as bng stretches .of DNA of i t»ase > 

. : V. composition, termed ispchores (denoted L/HI,-*/. 

, .H2, and H3)V;which are >30p/kbp .in le 

: : (69). Bernardi defiried.the L,(hght) isochores as /■ 

- : t G+Ppopr j(;<c^3%V ;,the :H {(heayy).] 

v ;\ . ■ . / ispchores fall into . three ■ G +,C-rich classes rep- 

• \. resenting 24,'% 8, and 5% of the genonieXGene ; 

; . ' concentration has ,been claimed to be yery.low.v; 

: in the L isochores and 20-fold more enriched in ' 

the H2 and H3 isochores (7C). By examining 

contiguous SO-kbp windows of <j+C ; content . 

. across the assembly; .we .found that regions of . 

G+C content >48% (H3 isochores) averaged 

273.9 kbp in length, those with G+C content 

between 43 and 48% (HI +H2 isochores) aver-^ 

aged 202.8 kbp in length, and the average span 

of regions , with .<43%, (L isochores) was, 

1078.6 kbp. .The . coTO^ between G+C ; 

^^^content.and gene density was also. examined in. 

^^kO-kbp,ymiddws along the assembled sequence.:- 

^^(T^ble 9 2^^^ Pigs.: 10-and 11). We found that;..- 

.*> the density pf genes .was greater in regions of ' 

high G+C than in regions of low G+C content, 

as expected. However, the correlation between 

G+C content and gene density was not as 

skewed as previously predicted (69). A higher 

proportion of genes were located in the G+C- 

' poor regions than had been expected/ 

Chromosomes 17, 19, and 22, which have 

a disproportionate number of H3-containing 

bands, had the highest gene density (Table/ 

10). Conversely, of the chromosomes that we 



Human genome 



.found to have tiie^^^ 4, ;v-Jsis...In; general,^^th'e■■rate■;^ in 

yi 8,' 1 3 and .Y>^alsb have H3 .bands. if^ in; males,- and this 

'.^Chromosome^y 5,7whi^^^ . .also;^has* ; fe>iy^'^H^ expansion" is' npt uhifonn across 

\ bancis^;Sid iot:have /-the/genome; (72). \Orie of ,ti^^ opportunities en- 
* . density^in* our analysis. La/addition, chromp-> r,. abledbyia nearl/complete^genbme ^eqiiaice is 
^ VSome;8,--v^^ 
f}.defisi_ty^;do^^^^ 

l^^^opaifim^ 
■^vin^bAerwis^^^^ 

;ii;pears;tiiat &e.^^ project. 
.^tain'^eft^l^ 

C i^efine a "desert kbp wi^but a i : -vthat 'cohstitute'the Genethpn^ii^ to V 

<:^gene,"then we see that 605.1^p; or about.20^^ ;> .the ;genpme;>11ie':rate 
v^pfihejgenbm^ l^eise jare^mpt^ 

:jmnifoiT^ 

^o.^^cmes;^Gerii|Ti 22:^'':|er;;jBtesIof:recpn^ 
f.haye'-^oniy.a]^^ 

r:c,;Mbp .in :deserts;^ A^ei^ gene-poor/ cl^^ ^(73):iFrom^ this ^iiia^iii^ 

somes 4,' 13^*18, andXhave 27,'5% of their 492 >^>^result, 'there is -a 'difference ! 
Mbp in deserts (Table.ll): The apparent lack of : rates . and highest rates and ithe largest I 

^; pre<Kcted gei]^.in,these;regions^4^ ^<^V J^pc^c? f ;.<iii^rence x>%4>^ >etweeri;imal€S;i&d iema^ 
. essarily imply that they are devoid of biological V K (4.99 tookl.on chromosome :1 6)/ This indi- " 
function. . . cates that the variability.: in recombination 

rates among regions of , the genome exceeds :: 
.fte.i^iffe-rences - in .repombmation '.rates ibe^^^ 
.> tween males;.and;females.^The:;lmman ge- 'V 
; . nomp has recombination hotspots, .xyhere re: 



4.2 Linkage map - . . 7 ' 

vLinkajge maps, provide the . basis for. genetic 
analysis and are widely used in the study of the 



inheritance pf traits, and in the positional clon-j. . • combination rates.yary .fivefold or more over ' > '* 
of geri^J jh^ jdi^ space ,of;l ^p^^sp the pictoe!phe;get^'of &e V^V- 

•A*(cN^,"js based^Oll.the^recpmbmati^ ber ^^magnitade •■■^ 
:;ytweenhom"^^^^ 



Table 9. Characteristics of C+C in Isochores. 



Isochore 


G+C(%) 


Fraction of genome 


Fraction of genes 


- "Predicted*-"-- ." .Observed . 


• ' . Predicted**"'^' ' " " Observed 


H3 : 




5 .9.5 


. : ;37 : 24.8 


H1/H2 


■ 43-48 


25 - 21.2 . 


32 / 26.6 


L 


<43 


67 69.2 


31 48.5 



•The predictions were based on Bemardi's definitions (70) of the Isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17.968 Otto tran- 
scripts and 21,350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 

•'sets have .the highest, 
number of transcripts 
iri the two^exon cate- 
gory, but the de novo 
gene predictions -are 

, skewed rhuch .more 
|oward smaller tran- 
Icripts. In the Otto set, 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 



7000 -1 




No. of Otto 
transcripts 

No. of de novo + 
1 line of evidence 



I 



i li mm 




8 9 10 11 12 13 14 15 16 17 18 '19 20 >20 
Number of exons per transcript 
have more than 20. In the de novo set, 49.3% of the transcripts have one or two exohsl and 0.2% have^more than 20. 
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a higher threshold (method 2) oii both data ■ 
:>sets,;^In^umv>:genomerwide Analysis 
:;:;^extended,(BarIier,^^naIysis^ah^ ^suggests a 
strong correlation between CpG islands and 
i .first coding axons. .-.r^ 



^v.«^yrvv^i.xiig genes nave fjpQ islands at the t c^twl^^^^.^vi -i- for exon i^dn -io ^^^^ ' i 

land methylation is correlated with geSe ^' scores less than : the . a h.VW ^.^IIm 

: fctivatipn (77y;aiid:hM::been shown to beS 
important during : gene imprinting (7*) ^^ ; 
tossue-specific gene expression (7P) .. 
Experimental methods have been used 
; in .an .estimate, of ,30.000 to 

.45.000 CpG.islands in the human genome ' 
{/4, 80) and an estimate of 499 CpG islands 
■on jiuraan chromosome 22 (5/). Larsen et 
■ «/. (7tf) and Gardiner-Garden and Frommer : 
; (7J) used a computational method to iden- 
tify CpG. islands and defined them as re- 
igioris of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 



itential-islaiid 
threshold.- 

^ ■ ;cOmpute|vM tCpG ^statiktiK;iwe ' 
used two different thresholds of CG dinucle-' 
otide .likelihood ratio. Besides using the orig- 
inal. threshold of 0.6 (method 1), we used a 

• higher threshold , of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2). which results in ' 
the number of , CpG islands on chromosome 

>: 2? close to Afe^humber of annotated genes on 
thiS'Chrbrriosnrhp fTiio rv.i:_ ".i_-i^:i»i- ■ ■ ■ / 



4.4 Cenome-wide repetitive elements 

The proportion of the genbme covered by 
yanous classes.ofrepetitiye DNA is present-^ 

^ this^hr^b^ ma;:^^^ 
4;mari2bd iri ^i6i^13^ f^fi^ 

with.rueth<J"|l<^SJ^^ 
. r^K ^^,^^.: W °^ ^«:S«l"ence -may ^,e:undeirepresente(^ • ^ 



CSA sequence as CpG, but 40% of the gene 
starts (start cpdons) are contained inside a 



-.40 




— % of genome 
□ % of genes 



30-35% 35-40% 40-45% 



% 50-55% 55-60% 60-65% 



45-50% 

Fig 10 Relat* . ^'^^ ^'^^ 

g^ome (1^ Sli^S^i^ '"^^ P^^c^nt of the 

genes associated with each G-HC bin Is Xlfpnl^t^f^ P^'^^"* numberof 
5% of the genome has a C+C contS^^^^^^^^ 

nearly 15% of the genes. ^5%, but that this portion contains 



the Celera assembly as a result of incomplete 
repeat resolution/ as discussed above. About 
8% of the scaffold length is in gaps, anii we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
: rpeat density (57%>ras -well :as 'the 'highest" 
vgen^density (Table JO). Of interest, among 
.'• the different classes 'of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed ' 
between LINEs and gene density. 

5 Genome Evolution 

Summary, The dynamic natur^ 6f genome 
evolution can be captured at several levels 
These mclude gene duplications mediated by 
RNA intemiediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
= translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
- logs and processed pseudogenes detected in 
„ oursurvey. We have also cataloged'the e^cteiit 
of segmental genomic dupUcatioi' and pro- 
vide evidence for 1077 duphcated blocks 
covering 3522 distmct genes. ^- 



i9E.NCE_.ypL 291 js February zooi 



1323 



: T HE Human genome 




-rim 







1 










Fig. 11 (continued). Relation among gene density (orange), C+G content 
(green). EST density (blue), and Alu density (pink) along the lengths of 
each of the chrompsomes. Gene density was calculated in 1-M bp win- 



dows. The percent of G+C nucleotides was calculated in 100-kbp 
windows. The riumber of ESTs and Alu elements is shown per 100-kbp 
window. ' 



5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to a gene that appears in niore than 
one copy in a given organisni as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or . 
1 identical proteins has been previously de-. 
^scribed (5^, 55). Cataloging these eyolu- , 
" tionary events on the genomic landscape is : 
I of value |in understanding the functional ' 
inconsequences of; such ^gene-duplicatioh 



events in cellular biology. Identification of 
\ conserved intronless paralogs in the mouse 

or other mammalian genomes should pro- 
. vide the basis for capturing the evolution- 
: ary chronology of these ^transposition 
r^' events and pro vide insights into gene loss 
V and accretion in tH'e mammalian radiation. 
\' r A set of proteins coirespqnding to all 901 
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;>,Table';n.iGenomevOvervrew. 



t ;it As ; not;::ex- elucii^^ the jDngoii^ 



: T H EH U M A N G E N O M E 

... Otto-predicted, single^exon genes, were suo ^ ; 5.2 Pseudogenes /r. 
: jected to BLAST analysis against. the proteins . A pseudogene'is a honfunctlohal copy that is 
: encoded by the remaining multiexon predict- - yery similar to a normal gene but that has 
..ed transcnpts,. Using homolpgyi.criteria.of /.^.beei filtered ^slightly •so>that..it-is -not--ex.' 
. 70%;; sequence .identity over .90% .of; the ' ' ' ' - . - 

. length, we identified 298/irist&$s of smgle^^ 
ivto .multi-exon con^espondencel 'Of .th^ 

^.sequences, 97rwere represented > the jGcn-,l!^ ,,^-.- ;.• ^^l^.'^ . .; I:: 

•, Bank data setvof experii^ntaily^iidated ^ ? 
;,fun,length genes ;at,the;;str^^^ 

;.and w^r^yenfied by:marml insf?ec^^ -V: v [-.y^,^' 

>,:; ; ;r:We believe .that theseV9T cases 'm^'Vep-;:a'i Percent genom^' :l ' - 'I "■' 

. resent intronless paralogs (see, Web table 1 on ^§ Percent of G+C in the genome ^ : ' ■ ' ' -".i^ v . ' 
, ' Science Onlme at \vww.sciencemag:org/cgi/4<*^'--^'^P^i?^-°^^^^^ genom'e^v r.M? 

:vcontent^lI/29i/55b7/13d4/bci) of:k^^ ■■'^'wUn^rm^^^'^^^ 

-.Eenes M^^^^ as repeits^ " ' ■ 

;.;,repeatsequences, although the precise nature . : ' ' . . ; 

. , of these repeats remains to be determined. All : Percent of annotated genes with unknown function 
,of the cases, for which 'w have :high conJfi^ n '-Number,of genes; (hypothetical and annotated) \ / ; - *- ^- ^ . - v-^ 
dence, contain polyadenylated |poly(A)] tails^^^-^^''*^^"^^^ genes-with unknown function ^^'^s-'i^-; 

Recent puWicaJons descri^^^ the phe-;jS^ • I - - 

nomenon of functional, intronless; paralogs/pieast gene-rich chromosomes^- -^^ f - .-1 4^ 
speculate that retrofranspbsitiori may serve as : . i - ' ^ : 

. a mechanism used to. escape X-chromosomal Jotal size of gene deserts (>500 kb with no annotated genes) ■ . 
inactivation {84, 86):Wg do not find a bias-' ■ '^^''^^"^ of.base pairs spanned by genes^ . . ■ ■ M 

toward X chromosom^^originadoh^or^^^ < ■ ' ' 

retrotransDOsed Irenes- rather tf,. . Percent of base pairs spa^^^^ - ^ 

' Percent of base pairs in intergenic DNA 

: . Chromosome with highest proportion of DNA in annotated exons 
U^hrpmosQme with . fovy^st propprtion of DNA in a nnotated exons -J 
Longest intergenic region" (between annotated + hypothetical genes) 
. Rate of SNP variation 



"presseO. We developed a method for the pre- " 
liminary analysis of propessed pseudogenes ; 
' rin. the:human genonie 'ai a starting point in - 
evblutionaiy forces 



retrotransposed genes; ratiier, the results 
show a random chromosome distribution of 
both the intron-cdntaining and corresponding ' 
intronless paralogs.' We also have found sev- 
eral cases of retrotransposition from a single 
. source chromosome 'to multiple target ciEiro- ""^= 
./mosomes. Interesting examples include the 
.retrotransposition of a five exon-cpntainihg'^ 
ribosdmal protein L21 -gene on chromosome ■ 
; 13 onto chromosomes 1, 3, 4, 7, 10, and 14, ' 
7 respectively. The size of the source genes can 
also show variability. The largest example is'; 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
> routCj v retrotransposition with subsequent - 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (57). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- ' 
sentatiori of genes involved in translatibnal 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source . , 
genes and their intronless paralogs could ac- 
count for differences in tissue.-specific gene 
expression. Defming which, if any, of these v 
processed genes are flmctionally expressed 
and translated will require further elucidation 
and experimental validation. 



- ; :2.9l'Gbp 

:0v^:.C;vi:^i.99-Mbp';r-^:'/^ 

^ .:3'fen^44 Mbp. •'■rJy-;i'^':y- 

^^^^li- ichr. 2 (66%) :^^--; - 

-26,383 ■ ■ ' '' " ' ' ■ 
V:^j-.'.;-,42 ■ . '-^.y.-- 

39.114 * 
.59 a" ■ Vjr^'''- 
'■ •Titin (234 exdris) '-'- - 
•;U27kbp;.;\../ ■ 
>;Xhr.:i9 (23 genes/Mb). 
;^ Ch^.13 (5 genes/Mb), 
; Chr.Y. {5 genes/Mb) • 

605 Mbp 
• 255 to 37.8* ; ' ■ 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
. Chr. 19 (9.33) v 
,Ghr. Y (0.36) - 

Chr. 13 (3;038.416bp) 
1/1250 bp 



;: *In these ranges, the percentages correspond to the annotated gene set. (26. 383 genes) and the hypothetical + 
;.annotated gene set (39,114 genes), respectively. : . . . / _ : . ■ . V 

; Table .12. Rate of recombination, per physical distance (cM/Mb) across the genome/ Genkhon m^ 
^>yere placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA. not applicable. 
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Sex-average 
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Max. 


Avg. 


■ Min. 


Max. 


Avg. 


Min. 


Max. 


Avg. 


Min. 


1 


.2.60 


1.12 


: : 0.23 


■2.81 


^r1.42 ; 


0.52 


i39 


'1.76 


0.68 


\ 2 = : - 


- . 2.23 


0.78 


^ 0.33 : 


.2.65 


' 1.12 - 


0.54 ' 


■y'3.17 ' 


' 1.40 


' 0.61 


3 


2.55 


0.86 


. 0.23 


2.40 


- 1.07 


0.42 


2.71 


1.30 


033 


4. 


1.66 : 


0.67 


0.15 : 


2.06 


: 1.04 


0.60 


2.50 


1.40 


0.77 


. 5 


2.00 


0.67 


0.18 


1.87 


^ 1.08 


0.42 


. 2.26 


1.43 


0.62 


6 


1.97 


6.71 


0.28 


2.57 


1.12 


0.37 


3.47 


1.67 


0.64 


7 


2.34 


1.16 


0.48 


1.67 


1.17 


0.47 


:2.27 


1.21 


0.34 


8 


. 1.83 


0.73 


0.14 


2.40 


. 1.05 


0.46 


3.44 


1.36 


0.43 
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2.01 


0.99 
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1.95 


1.32 


0.77 


2.63 


'1.66 


0.82 


10 . 
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1.03 
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3.05 


1.29 i 


0.66 
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1,51 


0.76 
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0.49 


12 


4.12 


0.76 


0.26 : 


3.35 


1.16^ 


0.49 


2.93 


1.55 


0.59 


13 


1.60 


0.75 


0.01 


1.87 


0.95 


0.17, 


2.49 


1.19 


032 


14 


3.15 


0.98 


0.18 


2.65 


1.30 


0.62, 


"3.14 


1.63 


0.75 


15 


2.28 


0.94 


0.34 


2.31 


1.22 


0.42 


2.53 


1.56 


0.54 


16 


1.83 


1.00 


0.47 


2.70 


1.55 


0.63 


^ 4.99 . 


232 


1.12 


17 


3.87 


0.87 


0.00 


3.54 


135 


0.54 


. 4.19 


1.83 


0.94 


18 


3.12 


1.37 


0.86 ' 


3.75 


1.66 


0.43 


435 


2.24 


0.72 


19 


3.02^ 


0.97 


0.10 '' 


2.57 


1.41 


0.49 


. 2.89 


1.75 


0.87 


20 


3.64 


0.89 


0.00 , 


2.79 


1.50 


0.83 * 


: 331 


2.15 • 


1.34 


21 


3.23 


1.26 


0.69 


2.37 


1.62 


1.08 


; 2.58 


1.90 


1.18 


22 

X 


. 1.25 


,1.10-: 


- .0.84 - ! 


1.88 


:i.4i 

A- ;.NA 


1.08 . 


i 3,73 


2.08 


. 0.93 




t NA 


: NA ; 


NA 


NA ; 


1 3.12 : 


1.64 


0.72 


Y > 


V Vna'^ ' 


' NA 


' NA ■ ■ 


NA 


" NA 


NA 


\ NA 


NA 


/ NA 


Genome 


4.12. 


0.88 


0.00 


3.75 


.1.22 


0.17 


5 4!99 


1.55 


0.32 























wwwjciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1327 
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' :e .;: . jthat ; account for. gene, inactivation. - The gen- pseudpgenes c (1 J 77vy spm:ce-*;gene^ 5 versus ; 

en4':Stmctiij:kl ' characteristics these fi)r6-i|?)tKe^rc^ 
\ Vcessed V pseudogeneSilincludcM^ /'complete ^^^li:T^ariscripts!^ P?®Vi' ^ 

.; /^i' lack ' of intervening : sequences i'fourid in the^ > v ; -:dogenes * have , * sHorte^;, average/ transcript ^ 
. -,:;/functional counterparts', a p61y(A) tract at the ...'^^^^^^^ 

.'^i' 3 '..end, J and direct repeats flanking the p^u- .v:^;^s^t) casxpm^ 
; a': ij.dogene 'sequence-Processed pseiidogenes pc-^fi^rpseudoge :-\v& ^detecte 
"^-^-cur:as ,a resiilt of retf otranspdsitidn, •3^^ i^^icpnfe^^^ not s]^ow|£a^^ 



;":J;?^|prb<ncte1i^^^t^^ tfieJgehornic;se-;v|^m 
^vl^^quenceVby -meWis "of ^BI^ 

jyUgions^^lcorrespo^ \ to: "all/iOttb4>redicted f ^^tor klphiaX 

^ ;v if transcripts were excluded from tWs aMy^s. o5:^teins>i^^ : of;; 

: - We -identified \ 2909. . regions ^matching .with .:^:%etrpttanspo (bpth;intrpnless 
■ :?;v;i^ ^greaterthan 70% identity ^oyeir'atleaist ;70%, of> ;Sa^ -Seines \ 

'-:?<^'.the .length'of the itaascripts^^'t^ repre-f. v*'=Hnyplyei 

>,.sent processed pseudogenes. jTlus^'niimber is an'incr;eased,trMiscnption-j^ 
f „ ^probably, an imderestimatei.because'lspecifia^^^^^^ : 
methods to search jfor pseudogenes were not 
used. ' ■ ■'. ^ 7---. 7 '--.V. 

. We looked for ; correlations - between 
; structural elements and the propensity for ■ 
retrotransposition in the human genome. 



1 53 Gene duplicati oh In the hunniari Vf ; J ; 

.'-genome'"' ' '■ 

: . Building on a previously published procedure 
: (27), we (developed a graph-theoretic algo- 



, GC content and transcript length were coin- . ; ;rithm,^^^ I^kj for 
pared between the genes with*, processed . human, protein set into 'protein families {89), 

table 13. Chararteristics of CpC islands identified in chrpmpsom^ (34^Mbp sequence length) and the 
. - ' whble genome' (2r9-Cbp. sequence, len^ by means^pf .two d^^^^ ■ 
' 5 likelihood ratio .of £:p.6, Methc^ 2 uses a CG. lilcelihodd ratio of 2:6.8. ; - '^V; -:;;^^ 



Chromosome 22 



J Whole genome 
(CS assembly) 





Method 1 


. . Method 2 . 


Method 1 


. Method 2 


Number of CpC islands 


5,211 


522 ; 


195.706 


26,876 


- detected 










"Average length of island (bp) .-T V. 


- 390' 


* 535 


. :i : 395 


497 


Percent of sequence 


5.9 


0.8 


. 2.6 


0.4 


predicted as CpC 










Percent of first exons that 


44 


25 


42 


22 


overlap a CpG island 










Percent of first exons with 


37 


22 


40 


21 


first position of exon 










contained inside a CpG 










island 










Average distance between 


1.013 ^ 


10.486 


2,182 


17,021 


first exon and closest CpC 










Island (bp) 










Expected distance between 


3,262 


32,567 


7.164 


55,811 


first exon and closest CpG 










Island (bp) 










Table 14. Distribution of repetitive DMA In the compartmentalized shotgun assenibly sequence.^ 






Megabases in . 


Percent 


- Previously 


Repetitive elements 




assembled 


of 


predicted 






^, sequences 


assenhbly " ^ 


(%) (83) 


Alu i'M'-viyf^zo zs~- rH^^O\<\^i^^^ 








' =10.0 


Mammalian Interspersed, repeat (MIR) 








1.7 


Medium reiteration (MER) -* : v - 




50 > - 


1.7 " • 


• • '1.6 ■ 


Long terminal repeat (LTR) 




155 


53 


5.6 


Long Interspersed nucleotide element 




. , .;.466 . , . 


.16.1 


- . . 16.7 


' (LINE) 










Total 




1025 


35.3 


35.6 



A^iTht^ complete^ -that , reisult fi:om the 

J 'Lek:clustering.pfo\ide for compar- 

; ;V.ing the role of whole-ige& or chromosom- 

al. duplicatibn in protein faniily. expansion as 
^ . 'opposed to other means, such, as tandem du-, 
. / . pHcati6n.';5ec^ rep- 
V;resents',a closed :sand.;cei&in;fslai^ 1 

),pgy,^:and b^ 

V riepusly r^blusterinig ^protein -of-. 
:i^;sevei^s organisms; st^^ 

} ; contributed %>each; brgam jcomplete i 

% cluster can ,be. predic^d 'with; connden^^ de- ; 

..^pending: on . &e; quaUty^of : t^^^^ . 

: each ; genome. -The ;yanance ^of .each organ- ] 
; ism's contribution .tO; each cluster csm'tiien be 
!xalculated, allowing an'assessment of the rel- ; 

:;J:atiye'';^ 

;-;.yersus;:;smaller-sc^ 

^;i-pansiOTiv and[. cpntr^tipnxpf • 
■ presxinably ; as" a result of natural . selection 
, operating on individual protein faniilies with- 
in in an cffgamsm. As^can be.Si^en'm Fig. 12, the 

: large variance ;:in, the relative nimibers of hu- 
t man, as compared with D. 7ne/o«o'g^5'te7* and " 

Caenorhabditis elegans protcms in complete 
■ clusters may be explained by multiple events 
• of relative .exp^siohs .inv gene Vfamilies -in 

each of the three;, animal genomes. : Such ex- 

.pansipns wouldvgiye , rise. to. tiie .distribution 

V that shows a.peak^.at 1 : 1 . in ]^e yfat^^^^ 
-;, human Vorm or ;himian-fly; cl^^^^ 

;:;:slope spread -covering ';b6th;jhumah.:aid -fl 
. ;\vorm predoniinaiice, ' as ;w6 ^'obseryed • (Fig.^ 
; 12). Furthermore, there are nearly as many 
clusters where .worm and fly proteins pre- 
dominate despite the larger nimibers of pro- 
teins in die human. At face value, this anal- 
ysis suggests tiiatnatoa^sdectio on 
, individual protein families has been a major 
force driving the expansion of at least some 
: elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, caimot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. , . , ; 

5.4 Large-scale duplications . \. . ; , 
Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly 
conserved blocks of duplication. We then 
; describe our comprehensive me&pd for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes., j.,^,-^ i: . r ::/ /r.rt 
The first of the methods is based on the^ 
idea of searching for blocks of highly con- 
served homologous protems that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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' tennined to be in the same family and tl 
- same complete Lek - cluster : (essentially ' 
^ paralogous genes) (<SP). Initially, each chro-' 
mosome was represented as a istring of genes 



Then uman genome 



filtering methods, a shuffled protein set was 
; first, created by taking the 26,588 proteins, ;' 
randomizmg their order, and then partitioning , 
them' ihtor.24 ;shuffled xhrompsomes,^;^^ 



V .ia!s at several evolutionary stages {94). The 
figure also illustrates that some chromo- 
somes, such as chromosonie 2, con^in many 

ordered, by the start. codons-fo7!'Dre5^c^ed- ' ---^-V^^^^somes,:: each ..-more. detec^^ than 

' /'om -A'inaf/*!, -i, ■'•o^^''-^'i^'-^^''^r^'A:' ofobservme this manvduoli- 

^ ^somai wocks pt fiuplwatipn vrere most^;.4ains.a iblocki^duplication tbit^s nearlv as ' ■ 

• >: ;C^9DS K ament^segme^ land chromosome i24Tk du^licati^^ 

, low. Ite.detecdon^ofodyarelahvd^^^ 

. quence of usmg an mtnnsicaUy cons^ative > 

> three, genes, 137 contain four genes,; a^^^ ' . * . ^ 

contain^ five or, more genes. . ^ -r ^ ^v 

: ..To .illustr^^ extent of the detected ^ 
duplications, . Fig. .13 shows all 1077 block 
duplications indexed to each chromosome in . 
V ;.;24 panels in which only duplications mapped ; 
to the indexed chromosome are displayed. 
The. figure makes it clear that the duplications , 
>;re: ubiquitous in ( the genome. One feature 



; method grounded in the conservative con 
' straints of the complete Lsk clusters;^^^';^^ 

In the second, more compreiiensivie ap- 
■ proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (P7). . Tlbis alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
, very rapidly; for example, two chromosomes 



;; the pair of chromosome arms. This breadth of 
p^duplkation is >lsc> seen on the two chromo- 
• somes carrying tiae other two Hox clusters. 

. An additional large duplication, between 
chromosomes 18 and 20, servesi . as a good 
,:example .to:illustrate vsome of :the fe^ 
conunon to many of the other observed large 
. J duplications (Fig. 13^ 
c6ntain^":;64 /detected.: w . 



nf inn 'KAUr^ ^<.r. 'u^ v J. J • r xt. . -7.--™^"**-"*' «x;,>xi^ Bwuyiiic. v^uc icaiiurc , conramS vO^vaetectea : ordered mtrachromo 

. mm (on a Compaq Alpha computer) with 4 / • mosomal stretches, with .one-to-manv •Hiinli.^^:^:o^,;nf;;J«^'->fhA:ii;':;:.^t: >.At.- ' 



> nipsomal stretches, -.with_one-to-many dupli- 
: cation relationships diat are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and . which has been 
. analyzed for genome-deployment reconstruc- 



"700 n 



.mm (on a Compaq Alpha computer) with 4 
; gigabytes of memory. This procedure "was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A thaliana (P2); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 

_ large dugUcatedj^en^^^ 
DNA-based aligimeht was sulficient to "re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were cpncat- 

. enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by,' the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (Pi); these represent the can- 

didate segmental duplications. A series of : 7 * - ^'-^ 

filters were developed and applied to remove 4 ^^^^^ predominant 

likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many protems were removed. To refme the 



V :coimting>'40-lv^^^^ of chromosome 1 8 
f.: _freei of matches to chromosome 20,- which is 
/'likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
reF'.on chromosome 4 8 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
. mosome 18 and 28 Mb on chromosome 20. 




Human/Worm 
Human/Fly 



10 10 



Ratio 



1:3 1:4 _ , 1:6 
fly/worm predominant 



Fig. 12. Gene duplication In complete protein dusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted * ' 
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T H E H U M A N G E N 6 M E 




By :tkis .:ineasure;: the Vduplication; segment;;^^^^^^^^^^^ of our' genome, and - 
\ spans nearly half of each chromosome's iiet,f>V=obseryed^^ many.cbmpareii regions. rHypothe^[v:L^^th'it a^ of ^ 

•length; -The most. likely = scenario is that the ^^ffsesit^^^^^ 

-whole span of this region was"duplicated as. a. \< .processes mustbe tested' .^;;vjv....;;^^ ' . 



/:^^>.some perspectiye on .dating of the] duplicationstt^J^y^^ 
l;-^{As;noted.al^^ 



:j;6 5A ;genom^^^^ of 

Variations v;;^=H:V - v^^^^^V:- 




single veiy large block, followed by shuffling v; ;Eyaluation : of ,tbe , alignmait results i:gives i-^ 
owing to smaller. scale feairangemente^^^^^^ j------ .^xi.. 

',!;f:,4such, at least fom-^subs^^^ 
i' *.,':>.w6uld;:need^ 

•v'V.v'lrelative insertions and ^ . . . . 

-'■v;!^ ! duplicated segment .iriteiyal/jnie'j^ 

^ v^:';^ -pairs in this'alignmentocoir among 2 1 7;'pro-?v^^;i in jtheyaige-sc^e^^i^^ ii^on to ^bthe^^JSNP ifesoiirc -SNI^ : rate;' b^'^f 

v: tein/ assignments .on'Tchronipkome'^^^^^ .wa^ —'l p^r;i2bo lo!^^ 

;r^:,^\r.am6ng^322^prote nonrandqmly"!; 

1: some 20, for a density of involved proteins .pf 3:^'; mosbn^ ; regioni;.Tlie^ , (X)rresponto -r. mouse ^j/tooug^ Oiily a very,. ;sn)nll; |; 

';■ ■ V 1 .26 to ;30%. Jhis. is consistent with an. ancient v;>,>; chromo^^^^ regions .aire |nuch.'morei .similar in n; ccproportion |of ;tall'j;.SNPs (<1 %) . ■potQiitially 
{ 'VvvX-vlargersdale; \d by "subse- S\ jv'^seque^ J^^^^. 9^. Ihk ifunci 'if : 

\ quenf gene loss ^ ' ' -^1^1^. ... . 1 1 .-r .ovm_.*t.-*--.rr.-* 

; : Loss .of just one member '; pf a gene pair 
- ■ -•V subsequent to the dupli^^^ „ .v - . ...... ^^ . .. . - - ; 

' . a failure to score a gene pair in the block; less v.. each. bear. a.sigmfic£mt proporfpn of geiies or-;^;f ; genetic^ariations may contribute to.the struc- ( 
* ' ■ than 50% ''gene - loss on the chromosomes IVVthologous ; to the* human genes ^on .which' the tural. diversity of hima^^ * . : 

' ^ ^would; lead to. the. duplication^density/'ob-'U^f human dupHcation assignments were made. On ; ^ - /v/ • '"::'>V'- j-"- . ■ 7 
■. .- served here. As\an . independent yenfication :.^^ 

of the si^ificance of the alignments detect-,U mouse. ichromosomal; spans, ^at.w 

. ed, it can be seen that a substantial number of ,.^tion, appear, to be pn>ducts,of.thasame lar^ -in the rate of gene discbyety, but only ihmii^^ ;^ 
..; the pairs of aligning proteins in this duplica- ::.uscale Jdup^^ in ^^A can wc,. 

. . y tion; including some of those annotated (^^^ 



4 




discover the genetic basis for variation in health 
out once a more complete genome is assembled - . : among himian beings. Whole-genome shotgun . . 
ifor niouse, the .undCTlying jarge.r.d^^ ./v sequencing .is a.particulily effectiye n^^^ 

/^.appear .to. preckte. the two; ^iecies Vdiyer^^ 
v3his <tos.the.dupy9^^ 

(.(ilyergenc^ of theprinmte arid rpdem spared 'the: distribiitiph^ • attributes; of ^Nj^STj: 

.CJhis date caii be fiirther refined upon exaniina-.'^-..V'^ascertain'ed;by ', 

'tion of ,the synteny. between human chromo-. ' iment of the Cele'ra consensus sequence to ih'c 
somes and those of chicken, pufferfish {Fugu / PFP assembly, (ii) overlap of high-quality reads 
rubripes), or.zebrafish (Pi). The only sub- : of genomic sequence (referred to as "Kwok"; 
. stantial syntenic stretches mapped in these 1,120,195 SNPs) (P7), and (iii) reduced rcprc- 
_ species corresponding to both pairs of human ■ sentation shotgun ;; sequencing- (refeBced-:lo as 

duplicated scginents^(see web table 2 on . duplications are restricted tp the Hox cluster : ; - "TSC!; 632,(540 SNPs) .(P^ These data were 
ence Online at . www.sciencemag.org/cgi/con-.. ;:^ ^egipns;^men .the, nucleotide di- 

tent/fuil/291/5507/1304/DCl). -We have also : . (or others) to human chromosomes is extend- : versity of -8 Ik 10 ^ marked heterogeneity 
observed a few instances where paralogs on ed with further mapping, the ^ ages of the across the genome in SNP density, ' ande an 



.13), are those populating small Lek complete 

, clusters (see above). This indicates that they . 
are members of very , smiall families of para- 

. logs; their relative scarcity within the genonie 
validates the uniqiieriess and robust nature of ■ .;, 
their alignments. ''■ ■ ""/[ - \ ^ - . : i- l 
Two additional qualitative features were ob- , . 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 

. ciations, with OMIM (Online Mendelian Inher- . 
itance inJ^an) assignments, are members ^f.- 



both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coaguiation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insigjits 
into .disease causation, with fijrther investiga-' ■ 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number . 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 1 8 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence, . . 

The MUMmer-based results demonstrate 
large block duplications that range in size fix}m 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96), The duplications have undergone 
many deletions and subisequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication arid mul- 
tiple smaller events. Further analysis, focused * 
especially on comparing the estimated ages of * 
all the block duplications, derived partially : 
fi^om interspecies genome comparisoris,Avill be : 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



overwhebning preponderance of noncoding 
variatioii that produces no change in expressed 
proteins. .. ^ : , 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, metiiods of SNP discovery make full 
use of sequerice depth and quality at every site, 
and quantitatively control tiie rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (PP). Comparison of consensus 
sequences in the absence of these details ncccs- ; 
sitated a inore ad hoc approach (quality scoa^s ■ 
could not readily be obtained for the PFP as- 
sembly). First, all sequence difFerenc^ between 
the two consensus sequences were identified^- 
these were then filtered to reduce the contnbu- 
tion of sequencing enors and misassembly. A., 
a measure of the effectiveness of the filtenng 



step, we monitored the ratio of transition 



and 



transversion substitutions, because a 2:1 ' _ 
has been well documented as typical in ' 
( malian evolution (WO) and in human • 
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QOl, 102). Hie filtering steps consisted of i 
moving variants where the quality score in the 
Cetera consensus was less than 30 and where 

ftedensityofvariants was greater tfaanSin 400 
.. :bp. lliese filters resulted in shifting the transi- • 
... bon-to-transversipn, yvratio ;;-'from .*e 1 57-1 to 
>/im:im^ ^Ued to 2.3^3bp,ofali^ents^ 
,,:tetvyeen,the;:CelerH>.andp^^^ 

..be^en^}set_^.S^ 
^PgerrMthc^a^ 

"6.2 Coniparisbni to f^bn^ 
I'databases .i^; j ■ 

•'i^f^'^SNP^Jnclu^^ 
:;foS?^;^^yw.ncW 

13.150 from,HGMD.(Human Gene Muta-- 
; t'on.;D^ase;v;fix)mjrthe mive^ 
'^^'^>yms.xaappcd on the Celera con- 
^ sensus f sequence «by^,a ^^sequenc^^-siiiiuiarity ^ 
search with the program PowerBlast {103^ The - 
;;two largest data, sets ;injdbSNP are the Kwok 
: •andTSCsets,with47%and25%ofthedbSNP ' 
records. I^w.<iuaUt3?:aUgnments:with partial 
coverage of the dbSW. seqiaence and W 
. ments that had less than 98% sequence idenS 
between the Celera sequence and the dbSNP 
tlankuig sequence were eliminated dbSNP se- 
quences inapping to .multiple locations on the 

^i^Jr'TX^^'^ '^'^^ A total of 
were mapped to 
.1,223,038 unique locations on.the .Gelera se- 

T^^m -^^^^ considerable redundancy in 
|m,SNPs,in the.K^^ 
. 3»5,81 1 unique genomic locations, and SNPs in 
die Kwpk set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
m this analysis, including Celera-PFP, TSC 
and Kwok. is 2.737.668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
Aesemefto&vmalspfoiuid^ 
od. The very-high overlap (3 6.20/i) between die ' 
KwokandCelera-PiPSNPsmaybedueinpa^ 
• to the me by Kwok of sequences that went L 

"""^y overlap 
(I6.4/0) between the Kwok and TSC sets is due 



THE:HUM AN GENOME 

; to &eir being the smallest two sets. In addition, 
24^% Of the aiera-PH. SNPs overlap wife 
SNPs derived from , the Celera genome se- 

'■;^?I"?°f^-:(f«:^SNp.^validation^;in .population; 

laMous process ' 

(,^;S9ajnfirmatiohon 

:? wde an efficient initial vaUdatidn '^iTsilk^Vbv^ 
■ computational analysis). ? , : •> ' ;. : . ' ■ 
,_>OneAmeMsTof.as:^ 

*P.nian;Mriatioh is to 



Imd !. °^*''''P **f ^NPs from genome-wfde 

the fr^ii '^T '^"'"''^'^ Parentheses are 
the fract on of overlap, calculated as the count of 
overiappl,^ SNPs divided by the number oS?s 

i'ckIS'""' °' <'««bases compared 

Total SNP counts for the databases are- Cetera 
PFP. 2,104.820; TSC. 585,811; and KwoU38 032 ■ 



^csure?^of^iiucleotid^ 

:::6;««^4^?*??^?t#^lidates::ti^ 

Jservatipns^at ^;^he ] wh6!e-gai(iiiie;;scale ■ 
vjJe^Kre?narka^ 

v^^^f?*^^^W^he-Kw^ 
'Sf-^'iffjn^iJS^le-genohi^^ 
^^.this substitution patteni.\C(,ni^at^:with- 
V the rest .pf-the .date sets. Celera-PFP <levi-^ 
-•ates - slightly ifiom ?the,2 M ,fransition-to-^ 
;^ ^version ratio /observed in :the. oA^ 
W;sets.^ This ;result .is :not -uiiexpected 
. ..because spme-fi-^ctipn olf ^the. coinputatibn^ 
; ally;; identified -SNPs in the Gelera-PFP " 
comparison may in fact be sequence errors 
A 2:1. transitionrtransversion ratio for the 
bona -fide SNPs vjrould W'obtained- if one 
assumed that ,15»/o of the sequence differ- 
; ences m the Celera.PFP set were a result of 

■ ^ P'^*S™p*'Jy .random) sequence efrors.\a 1l 

5.3 Estimatron of nucleotid^ divei-sit^ 
xfrom ascertainied SNPs x 7 "i' 

The, number Vof; SNPs identified -varied 
. widely across chromosomes. . In order to 

■ normalize these.values to the chromosome 
size and sequence coverage, we used -u, the 

f/zTJ^^'x/''.^''''''' nucleotide diversity 
..(AO^).;Nucleotide diyersity^is-a measure of 

■ PeM'te^i-^etqzygosity,. quantifying the 

.probability. ;that -aypair . of xhronioso^^^^^^^ 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
m methods like reduced respresentetion se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



s,.c. These date are not.readily available, so 
we could not estimate nucleotide diver ity 

J.?tide^,yersUyifromijilgh,q sequence 

?^'^^;'f^ation;-i^^ebdeai& the deteilj 
■'^i'-or :aU-the. alignifaMts. ^'fevt;?^^ 

Ji?3:^,:^timation of^ucleptide-diver^^ 
,,^;^hotgun 3sseml^lj^e^ fo, ^3,^ 

#«^r:m(Mei^ 

SS?and,|he; prot^bili^c^ 

f^mk ^e^allelesjiayeidifeeirt-s^ ^ ■' 

^;f;?me,prpbabiH 

^f^m^iH^^^:mpi^^.^ ihe^higher 
..|v^^seguerice;^uality,<;^e:high^r^^ : 
j^fiiof successfully detecting a SNP (70i)..Even 
^j;ijafterwrrEctingfbr,variatioii^i^ % 

..:nucleotide divers.ity^appeared jto, vary across ■ 
^5iffl?9|omes.;Tlie:sigiim^ ; ■ 

^^;'i?'?y:;yas tested byi^lysis:^f viriMce: ,4h i 
.;i';estmiates of *>fpr,10p-kbp^ivriiid^ esti- ' 
3.,.;mate.yariabiHfy,wh^ . . 

i; .Celera-PFP,xomparis6n. > = 291V p 

;,r;p.opoi). v^;o.;>>^^^ • 

,!:;J#%age,d^ es- : - 

, M was^8.94 .X .10-; Nucleotide diveiify on 
/; . me A ;Chrpmpspme .was (5;54: X 

is :«Pected ,to..be less .variable thkn au- • 
• tosomes. because for every four copies of 
. V autosomes in the population, itherc are only 
- V^?.** <^ :?^°'^o^om^ smaller ef-' ' 

';^:f^^!'f;y^.PoP-Vlation:S.i^^^ ; 

:^^°Ilft:iWdlMpre jiapidly.^ ,: • 

...;.from;the X.(i05). ■.. - . ■ , 7 

. ■ vHaving. ascertam^^^ 
. genome-wide, it appears that previous esti- 
mates of nucleotide . diversity . in humans 
based on samples, of genes were reasonably 
-accurate (70/. 702, 705. 707);, Genome-wide 
our . estimate of ; nucleotide diversity was 

:Viand,a;published:,estimate,:SV^^ 1q 



6.4 Variation In nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes, in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 

Table le- Summap^ of nucleotide changes in different SNP date s^. 



SNP data set 





TSC 


Kwok 


Celera-PFP 


. 188,694 


:v 158,532 
(0.362) 

,.1;:; 72,024 
J (0.164) 



A/G 



C/T 
(%) 



A/C 



Celera-PFP 

Kwok* 

TSCt 



A/T 

(%) : 



30.7 
33.7 
33.3 



(%) 



T/G 
(%) 



30.7 
33.8 
33.4 



Transition: 
transversion 



" 10.3 
8.5 
8.8 



8.6 
7.0 * 
7.3 



9.2 
8.6 
8.6 



10.3 
8.4 
8.6 



: - 1.59:1 
2.07:1 
1.99:1 



2000 release of NCBI dbSNP (www.ncbl.nlm.nlh tLmlSHPlZuh^i, Washington University." , tNoveiiiber 

TSC-WUCSC Thesubmltter oV the data U ^"KK C^S^t^artf ^^^^^ 
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■ Rg. 13. SegrrWntal duplica^' 
c tions'^ : between chromo- ' 
somes in - the human ge- *' 
: nome. The 24 panels show 
the 1077 dupOcated blocks 
> of genes, containing 10310 ; 

pairs of genes in 'total Each ■ 
i-:V line represents a pair of hcnf ' 
-.vi, mologous' genes belonging ^¥ 
^■v. to a block; all blocks con-. .. 

tain at least three - genes . . . 
i^?ibn;each :of/the chronlo-/^ 
: :>.somes where they appear. 
V-Each panel. shows all the 
• - duplications between a ' 
>; single . chromosome.^. and - 
...other . chromosomes^, with;- ; 
;V: shared blx>cks. * Tliejch^ 

rhbsome at tihe center* <»f:y: 
s* .each panel is shown as a . .. 
J. thick red line for emphasis. ■] 
[, ' Other chromosomes are . 
displayed from top to bot- . 
. tom within each panel or- 
dered . by . chromosome 
number. The inset (bot- 
tom, center right) shoNvs a 
dose-up of one duplica- 
tion between chromo- 
. somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shovyn. 
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total .SNP 
Kwok 




;f /■ 



:somes, "A and .^whether !ithisiheterogeneity,:is^^^^^ 

^^eater ; than expected; by ^chance: ' if SNPsV^iicie6tide!:(dive^ and ^ 

.occur by 'random ;and independent mutations; ■'•i across the entire genome and found that tbe'*^\1SNPsV•respe^^ 

then it would seem that there ought to be;a- :]/:|:c6rrelation betwieen them was positive (r,=Yt;7tein; ch^ even smaller l&ac- 

Poisson . distribution of numbers of SNPs . in y^^^^^^ lughlyj.sigriificSit' (P;,;< in 

fragments of arbitrary constant size/.T^ 
^served dispersion in !the distribution ^ofSNPs^^^ 
■in'do(Hd3p;!fiagments vwasvfkr^g^reatCTrthairr^ 

^predicted tftom :a-;Poiss6n;;yistributiori ' (Fig?f^^ \iy^^^i}^:^v^ 
14). ;Howeyer,^this siinplisticimodel ignores. {;§^>T^ 
;the,different recombinatibn rates ^^^^ 
:tion Mstories. that exist. in different 
the genome. Population genetics thebryj 
^that.we can, account for.this variation with V-^^OT silent),- in-/-::v tates.Were od 

■ mathematrcal fonnulation called ,tiie neutral ^ ■'.vtrbhic, and > 3 VUTR ^Tpr ^vl 0,239 >?laip wii .vi^;^oie?ddbite^ in* exons; : than m 

vcoalescent (i(?P)- ^Apptying well-tested algo^ ' NQBhRefSeg . da^v:^^introris^ ;and ^in v^)<iagem^ ixi m- 

rrithrns /or simulating •the;;neutfai".b^ hiiman-genes p^^ v'Strons (Vdj^^Many .of^^^ will'^ 

■with recombination (770),; and using. an :ef-::^^_the\Celerai<Otto'-^^ 

fectiye population size of .10,000 ;and )& per-:fe Vgions, ,SOT ;>vere;, categonz^ as.eh -CvinarkOT^^^fo^^^ and 
J base recombination rate .equal to the mutation : :7 lent,r fof -iWse 'that**<i6' nbrd^^^ 
< rate (777), we generated a distribution of nun> <v.: acid sequ^ . ;^:fimction as well- ^^.-1 :.:^-u.^:: 

bers of SNPs by this model as well (772). The ; change - the' protein -product. .The :ratio of * ; . " : - > ■ v 

observed distribution of SNPs has a much larg^^-^kmissense >t6. silent coding > SNPs in ^ Gelera-^-^^r-. 7^ An Gyen^iew.o^^^^^ iPredicted ^ . . 
: er variance than either the Poisson model or the : : PFP,^TS 

xoalescent model, and the difference is higWy : ;0.^ . Genome /- . : ' ^ . 

significant This implies tiiat there is significant .; ; duced frequency of missense variants 'corii^ ^y Summary:-. Tliis .secti^ ^an^. initial 

' Variability across the genome in SNP density, v- ;-;^^ with the iieutral expectation,'- consis-' . . . cbmjputatibnai ;'&alysis ;: of • 'tiie ^predicted 

an observation that begs an explanation. . tent with the elimination by natural selec- protein 'set with... the aim. . of cataloging 

Several attributes of the DNA sequence tion of a firaction-'of the deleterious aniino :. v proniinent . . differences .Van^ '-^^imilanti^^ 

may. affect the local 'density of SNPsJ :in-5> acid .dh^ are* coin-' ^ :^heh t 

; eluding the rate at.which DN A: polymerase, t; ^: parable;. to vt^ of ;/w'6ther^lly^^^^^^ 

' makes errors and the efficacy of mismatch ; 0.88 krid 1^7 found by CargiU i/ c7. <707) ? ;:iGveir^ 40%Sbf; t^ in 



efficacy i 

repair. , One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase . in the 
mutation fate of CpGs over other dinucle- 



0.05 



v and by Halushka e/ flZ. (702).>Similar re-- 
: suits were observed in SNPs derived from 
Celera shotgun sequences (46). 

r It is striking how small is the firacdon of 
SNPs that lead to potentially dysfunctional 
X alteratioris in proteins. In the 10,239 Ref-v 
•T Seq genes, missense SNPs were only about . 
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Fig. 14. SNP density in each 100-kbp Interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black. Celera-PFP SNP density; blue, coalescent model; an<j red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandonri and Is not entirely 
accounted for by a coalescent model of regional history. 



: humans ..cannot, vbeo'ascribed :a * molecular 
function by methods that assign proteins to 
. known faniilies. A -protein domain-based 
: analysis provides 'a detailed catalog of the 
prominent differences' in the human ge- 
: nome when .compared with the fly and 
V worm gerioines.Tromineht among" these are 
. domain expansions in proteins involved in 
:^:developmental • regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enxmiera- 
tion of protein families and details of pro- 
tein structuire will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyse and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 

' piredictionis with atleast twolines of evidence 
as described above. The first method was 

, . based on an aiialysis at the level of protein 
families, with both the publicly available 
Pfam database {114,115) and Celera*s Pan- 
ther Classification (CPG) (Fig- IS) (116), 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (775, 777). 

The results presented here are prelimi- 
nary, and are^ubject to several limitations. 
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Both the gene predictions and functional 
/assignments have been made by using com- 
. putational tools, although the statistical 
; models iaPapth^^^^^^ and.SMART have 
r'^^i^^^y ?^^ ! and reviewed by ex- 
tt^P^^ fe??fe'Sists.|[njth^e set of computationally 
1 irprecHcted ge^^^ both false-positive 

. * predictions; (some 6^ these may in fact be inac- 

•^'^^•:^^fS^^^ predic- 
7:^"^?^Xsop^?;li^ not be coniputa- ■ 

i^^|iP?^^y-P^ enbis in V 

jhelbq|un exons and genes. 

assign-^ 

f signment^ families 
^ that tend to be^fol^^ across several organisms, 
; br on families of known human genes. There-. 
,;: fpre;^e:dp not function to many genes 

. : f^ltiffi^'ji^^iH^ even if the fimc-. :'. 

'tidh is IcnownJ'jlJiJe^^^^^d^ all . 

, enumeration of the geiies in any given family or 
>: furi from the set of ^ 

, 26,588 predicted proteins, which were assigned 
. functions by using statistical score cutoffs de- 
..fined , for .models , in i, Panther, Pfam, • and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: ?(i). What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized; with .current classification meth- 
(^^) "^.^^^ core functions that- 
■appe^ar t6;^^b^ the animals? 



, The Human g e n o m e : 

(iii) How does the human protein comple- 
ment differ from that of . other sequenced 
. . eukaryotes? V * V":'^.;M'?;^X ■-•>■. Jl 

t: 7.1 Molecular^^f u^rt 
, '^hunrian /proteins ■ 



Figure ^,shows , an . oye^ 
; tive Tinolecukr^iunctions/p^^ 
26,588 ihunian, proteins ^ti^ 
;tv^ ^lines;:of ■ supporting';;!^ 

;v 41^/(12,809),it)f >thejgehe^ 

/not J5e,classified;;fipm;^^ 

: and , are > termed;; /Jjroteins^^^ 
-.functions. Becajise/oo^ 

: cation methpds'treat;^^^ 

> protein :^niilies/>^ere|are^ 

. "unclassified" .sequences^tha^^ 

; have a known or predicted :function!^ 

: ^60% of the; protein ;spt'that^have^ 
: functional predictioil^,;^tie;^]ie^ 
functions have.; been;; piaced^jri^ 
^classes. :\Ve focus , here cm^ 
tion (rather than: higher ;0i3ef'c^:iru^ 

, cesses) in order to;classify' as many proteins. I , 
as possible.;;These;fimctionarpj^^ 
are.^based on ^similarity};^ ^s;equ^nces/ofO 
known function. ;. 

In our analysis of the 12,731 additionai low- 
confidence predicted genes ;(those;^^ 
one piece of supporting^;£vadence), <^ 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. ; One-third ; of \ these : 636 - predicted 
genes represented, endo^emSis rrejroviral ^pp^^':.^ 
teins, further ; sugges&g;A^ 



f these unknown-function ' genes ;are not real 
c genes. Given that most of these; additional 
}-^- ^,095;genes appear to ;be unique among the 
:^g^cp^s. sequenced 1^ simply 
.^represent false-posi^ 

^:^;;^f^^e mostco are; 
t^,J|ieiranscr^ in 
;;.nuc)eic:^cid n^^^ enzyme). 
fePtha^imi^^ : ■ 

K:;^and livdrolases^^^ the 



?-^h>^6lases^a^ rnany 
^:ti:prpteins;:that-iare;mfe^^ . 
j;,femili^ 

::V;^6ry proteins mvolved iii specif^: * 

' ti^;steps of signal rtransductioh . such as hetero- *' 
; trimeric QTPrhin^g proteins (G proteins) and ■ 
|cell;cycle;regidatore;and (n)prote mod-' ' 
|i^?*?:Ae :acti^ 

^phosphiateses:^:!:^^ ■■ 



^able7;l7^I Distribution ';6HSNPl^y^^ of ^ 

-genomic^ regions. '''D^■;vVl^■•■^^V/'^■^^^^^ V 



Genomic region 
; class 



/*;■^;■ ^region 
examined ; 



;:'V^.Celera-PFP 
\ SNP 
density 

j; (SNP/Mb) 



Intergehic ; '--l-r 
Gene (introh + 

exon) 
Intron i - 

First intf Off ;; -^;^ 

First -exdn^^^T^^ 



646 : . 
;■ 615' 

■::;^i64jj: 

;^v3i''^::; 



1^707 
917 

; i921 

W529 



cell adhesion (577. 1 .9%) 
miscellaneous (1318, 4.3%) \ chapcrone(l59,0.5%) - Vr- ^ 



viral protein (1 00, 0.3%) 
*nmsrcr/carricrprotcin(203,0.7%) \ 

nucleic acid eniiymc (230S. 7.5%) 



signaling molecule (376, 1.2%) 



receptor (1 543, 5,0%) 



kinase (868, 2.8%) 



select tcgulatoiy molecule (988, 3.2%) 



transfcnisc(610.2,O%) 

synthase and synthetase (3 1 3, 1 .0%) 
^. (Kldoreductase(656.2.1%) 

(yasc(H7.0.4%) 
; . ; Kgasc(56,0.2%) 
' .Yy' ; . > : • J*<»ncrase(163, 
' V T-^ r !' hy<*rplase( 1227, 4.0%) 





Q^askeletal structoral protein (876, 2.8%) 

extracellular matrix (437, 1.4%) - ^ 

^immubogl5*uliii(264,b.9%)^ , " . 

londianncl(406,I3%) .; ; T. . 

motor(376,l^») 

structwal protein of muscle (Z96^ !,()%) . . 
protooncogcnc (902, ^9%) 

sclcrt calcium binding protein (34. 0.1%) 
intracellular transporter (350, 1.1%) 
-transportcr(533,1.7%) 



^^GO categories 



molecular function unknown (12809, 41^7%) 



Panther categories „ ^^-^ - t,. - • 
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Fig. 15. Distribution 
; of the molecular 
functions of 26,383 
human genes. Each 
^-sllsejists-^the-numr..:=.--.. 
: bers and percentages 
. (in . parentheses) of 
: human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories In 
the Gene" Ontology 
(GO) (179), and the 
Inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 
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V 7.2 Evolutionary conseiVatidn core -^/y n^ 
Processes ■iii-'^^^B^^i^^ 



.^iV,-- . 

^v' rCZ-^^j-vwe Jdehtified . two /different /cases, *for;^ 
Weach ^pairwise xomparison : (humari-fly / and;? 

■ Because ':0f ,thV\vanou5 ..^ ^-first. case., was 'a pair iofv; 
I'', gehome-sequencin'g"'; proj^ct^ - "that have al-. ; v.' genes, , one from ■ each organism, ? for , which / 
•■. ready been cdnipleteH,; reasonable c6mpara-| v' there was -no :other, close -homqlog in either. 

tive inforinafion is av^^^^ are straightforwardly idehti- v 

an'alysis pf itfie Jevolutio^^^^^ ;as ^orth'ologpus, -because xthere ;are':no :f 

->V nome.^^The ^genomes3p^^^ ^ceri^^ Jivadditional members of the 'faimlies t^ 

!\efs* /ypasC^ rauad ;tw^ 

■ brates,; C.g/e^a/w 

v":"and D, melan6gdsteKXfiyy(2^^^ as well as tiie >more than one member in either of both of the 
>:::fu:st plant genome,;^.^/^^^^^^ com- ? .f: .orgamsps .bemg ' compared/; Cheryitz ^et '^dl:;}r 

';;^;pleted'(P2),prpy^^^^^ (72^?) ;deai..with.:this case.; by.'-analyzmg .-aj.. 

/ ygencbie coni^ariso)^ tree that described the xelation- v , 

: - . ■•r We enumerated the of the' sequences ^in both i; 

* ■ served between hijma^^^^^^ and^ then looked for pairis of genes j,' 

^•J; human^^and •wonnli^g.5i6)\^o;^adto nearest, neighbors in the,tree;'If the^j^^ 

';v; question, - -What " a^^ -'functions .-iat /v^ nearek^neighbor.^^ pairs., were - from different .> h 

'appear to J)e*x^^^ those gbnes were presumed, to be^-;:' 

' The ' concept of ;,6i&blogyj is/mpor^ -be- orthologs J We :note that these nearest neigh- 
. ■ cause if two genes are ortholbjgs, they can be / v bors!can often be confidently identified from . 
traced by descent to the common ancestor of v ^paiiwise sequence comparison ; without hay--; .4 

^. ;*ing*.to examine a phylogeneticvtree ,(see leg- ' 
• end to ;Fig. 16). If the .nearest neighbors .are , : 
vnot from different organisms, there, has been r . 



^^;sidei::only^si^^ tibe proteins 

f;with .^unambiguojbs^^ relationships 

^^(?igS;19-;^y?^i^^^^ are 2758 

^^st^cf}^hlm^ar^^^ ;'human- ^. 
/^wbmX1523.;in;:^^^ 
::>We;yefine:&^^^^ 



elegarisj^ 



- the . two' organisms :(ari ."eyplutionarily . con-.'. 
: served protein set"),' and therefore are likely : 

to perform similar conserved functions in the 

different organisms. It is critical in this anal- , . • 
. ysis to separate prthiologs (a gene that appears : , 

in two organisms by descent from a common . 

ancestor) from paralogs (a gene that appears 
, in more than one copy jn a given organism by ; 



a paralogous expansion in. one or both organ- 
isms after the speciation event (and/pr a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 



. ; becomes mbiguous. For our initial compu- 

a duplication eyent^^becau^^^^ 
, subsequently diverge, in fiinction,' Following ^hV.tein set, we could hot answer.this question for : •- 
the .* yeast-wpnnXofthplo^^^ protein. - Therefore, -we con- 



: suiprisirigly^ (the, set., of xonseryed .proteins' is 
^. not /distribiited. am 
:;,the 'same^ay £s: e wh(^e 
> Gompared ;.with;^:th^ \ set .;(!; ig. / • 

; 1 5),>,there are sevi^^ 

• represented iii^tea^ abactor of > 
:'j;?T^';pr.inoreV^ acid i 
;'€n^jTOes,;-pira ma- 
^chinery'-'f (ho1abjy\fpN . methyitrans- . 
;Vferas.es;0NA/RNA^ • helicases, ..i;' 
i DNA ligases, ^NAt \ and/v-M^L^ ; 
; factors, : nucleases,/ ^d.ribosomal; proteins). 
VThe. basic ' transcriptional and , translational . 

•■ macliirieiy . is; welL known to have been' con- ; 
served oVer evolution, from bacteria t^ 
to the most complex eukaiyotes.. Many ribo- . 
nucleoprpteins ihyolyed in IWA;s^ ':[■ 
/appear to be.icpnseryed^^ 

• Other, enzjone -tv^es^^^^ '■ 
- ed i (transferases, '^oxidore'du^^^ 
vlyases,':arid. isbnierases). '^Many - of .these .eh-' ^ ' 



Fig. 16. Functions of putative 
oithologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs". between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits [180) such that each 
orthologous pair (i) has a 
BLASTP P-value of ^^0-'^^ 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
Ism, I.e., there has lilcely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound dri the number of 
orthologs'. - By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in . 
common between these sets). 



cytoskelcta! structural protein (20, 1.2%) 
, ch3pcronc{16.0.9%\ 
cell adhesion (lip 0.6%), 
miscellaneous (72, 4.2%) ^ 
. viral protein (4, 0.2%) ^ 
. • trans fcr/camer protein (11, 0.6%) <> 

iranscfiption factor (8 1 , 4.7%) . 



nucleic acid enj^'me (221, 12.9%) 



receptor (23. U%) 



kinase (69, 4.0%) 



^ select rcgulatoiy molecule (88, 5.1%)' 



transferase (70, 4.1%) 

synthase and synthetase (64, 3.7%) 

oxidorcductasc (64, 3.7%) 



extracellular matrix (12, 0.7%) if^ . 
ion channel(7. 0.4%) . . 

molor(l3. 0.8%) ,^-^::j^"r^-:f^r^:~r'_'. 
structural protein of muscle (8, 0.5%) 
protoonco£cnc (23, 1 .3%) 

intracellular transporter (5 1 , 

transporter (44. 2.6%) 




3.0%) 



molecular function unknowii (613, 35.8%) 



(yase(l2. 0.7%) 
ligasc (9.0.5%)' 



hydrolase (80, 4.7%) 
Isomerasc (21,1.2%) 
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^es are involved in intermediary .metabo- ' 
Iism. The only exception is the hydrolase 
category, which is not significantly oveirep- 
, . reseiited in the .shared^prptein ! set.- Proteases 
;:fonn .the;largest- part of.tW.s ■category.i'and' 



The Human genome 



in development and homeostasis* hprnn ^-r - 

stasis; and.(v)apoptosis. ??i ?T — JjT""T ^^ ^^^^ ^^ in- 
..• Acqui«d.:Jbimiinity;-Ki^v.o^* t4 ^A^ti' . P^cesses of rieura! dwelop. 

:.strildnJ.dif^nc.s,S^ 

: .m each of Aese.three.prgamstns^^ 



-cludm^:.ADP.nbosylation::fectoO-a^^^ 
.^c^Ie^glJatorsXp^ticulaaythe^^ 

ily,^yc in Ciamily;.aii^ 
rproteinkmases).W^ 

■oyerrepwd categories are:pi^te^ 
iP°rt,and;trafficJdng,^^^ 
n^stc^r«eryedgrptq,siritheseckte^^^^ 
:Pf?t^«^i5vqlyed;in^ated^ 

jansport,^ chaperonesM ; 
folding and heat-shock resppHse Qiarticularly^V 

the :DNAJ >fteyiy;:;&d^heat:sh^i?proteiri ■^ SfiP showii to-regulate ii^u^SaS 

60 (HSP60). HSP70. HSP90^filiS - 'CSS SJsS^^^ 
ITiese observatioris providepnlyi;conserVi'^^ 

estirmte of:the;p^tem;;femili^ 
context of , specific :celluiar* processes that' tranTduSn fo,^^^^^ ' and gliogenesis (iitf) 

were.likely derived fh,m the' ast! common ^ ^S^^^StS^'^T '^^'^ Other human expailded gene families play 
ancestor pfthe;human, flj^ and , wo^^ : featuS^SS'^S^Sl^ Vi'J^^^^^^ 

tpund m the signal transducerand activator of 
transcription (STATs), the suppressprs pf cy- . 
tokine signaling ;(SbGS), and protein inhibi- 



Stated before, this.analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
/thplogs. difficult withiii' the membere of con- 
served protein families:; :^:'^^:-;;.:^7 ^--^y 

73 Differences between the human 
genome and other sequenced 
eukaryotic genomes 



To explore the molecular building blocks of funcZ^ n f^'^f «*r"c*"re, and 

the vertebrate taxon, we have coL L : 1^ ^'^'^'> ^ compared 



the vertebrate taxon, we have compared the 
humaiugeaome witbi; the- other ^ sequenced 
eukaryotic genomes at three levels: molec- 
ular functions,.protein families, and protein ^ 
domains.- ... ^ 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domam families (defmed by sequence simi- 
larity, e.g., the serine-threonine protein ki 



- . r» , ' ~ "-"'V, ."1. Willi a:) I, 

many Of the^animal-specific protein domains 
that play;a role.Jn ^iiinate immune response, 
such as the, Toll receptors, do not appear to be 
.significantly expanded in the human genome, 
Neural , development, , structure, , and 



panded more than twofold in humans relative 
to the^ invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
- Ca2+ ^sensor ; (or 



tors:of.activatS:stATV Vms^^S^ - ^^'-V^^^' receptp9Vduring:^ynaptic 



nases) and superfamilies Ydefked b72i^^^ ««own pnenotypic differences between the 
molecular funLnSm^^^^^^^^ 



molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have foimd that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel-' 
opment, structure, and functions; (iii) inter-' 
cellular and intracellular signaling pathways 



With the worm and fly genomes, there is a 
: ;marked-increasc-m;the number of members 
of protein families .that .are ; involved in 
-neural.development; Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins/ as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voitage-gated ion channels, and syn- 
aptic proteins such as- synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 



the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural ceir types (as many as a 
thousand or more . in human compared with 
a few hundred in fly and worm) (121); (iii) 
. the increased length of individual axons; 

and (iv) the significant increase in glial cell 
; nuinber,; especially the appearance of my- 
elinating glial cells,"which are electrically 
inert supporting cells differentiated from " 
the same stem cells as neurons. A number 



the; increased co-bccun:ence /in ''h&mis^ of 
^ : PDZ= -and the SHSldomains " ui'^iSuronal- 
: specific adaptor molecules; examples include 
. . proteins that likely modulate channel activity 
at synaptic junctions (725):; We also noted 
expansions in severaL ion-channel families 
(Table 19), including the BAG subfamily 
; (related to cyclicrnucleblide-gated chann^^^^ 
v.; the.^oltajge-gated.:icaiciun^ 
* rfamily,^ the -inward-refctifier pbtassim chan- 
' nel family, and the voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calciuni-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (I29). 

Myelm basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peiipheral' myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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r ""Table 18. Dornairi-based comparative. analysis of proteins in H. sapiens (H). 
; [ O.melanqgaster (F), C etegans. (W), S. cerem/ae:(Y),-ahd AMaliana W^e4 
V j- predicted protein set .of each of the^aboye eukaryoticoi^ariishis , Was analyzed 4 
^vVwith Pfam version 5.5 using E value cutoffs of 0.001 :-;TJie. number of proteins 
V;; containing the specified Pfam domains as well as.thetotal number of domains ji^J 
:i (in parentheses) are shown in each columa Domains yvere categorized into -^i 
- ' cellular, processes for presentation. ^Sonie domains p.e.7 SH2) are llisted in ' 



;;::more.than-pne^ceUular:prpcess.;Resul^ bf the:Pfam;ahalysis may differ from 
.y^r6L5ults.obtained based;pn;hu ^^^^ 
imitations of ,large-scale.:au^ 
of dorpains with reduced.count^^^^ 
-4^is.analysis are'marked with a .douMe asterisk •(*>):' Examples include short 
:vdivergent.„and'predominantly:alpha-helical.domains;^;and certain 'classes" of 
:i\cysteine-rich zinc finger. proteins, ' ; ■ ' ^ {i;- *' 'v - -r; "C'V: 



S. Accession 
- number 



• Domain name 



i^:::s,j£XDmain description JkiliiLi-^f^^f:;'^^-^^^ V^/M-P'^^^^ ■ . vV:^W v^^^Sy 



r^'V PF02039; 
.: . ;PF00212 
^'";^PF00028 
..:',PF00214. 

.■pFomo 
•;pF0i693 

>F60029 ■ 
, , ;. PF00976 
• HPF00473 : 
' PF00007 
PF00778 
..>F00322 
PF00812 
PF01404 
PF00167 
PF01534 
PF00235 
PF01153 
PF01271 
. PF02058 
PF00049 
PF00219 
PF02024 
PF00193 
PFO0243 
.PF02158 
PF06l84 
. PF02070 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PFd0341 
PF01403 
PF01033 . 
. PFOOI63 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 
PF01099 
PF01160 
PF00110 

PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
.PF00039 
PF00040 
PF6005r 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



■ ' ^Adrenomedullin ' 

. ^ TCadherin 0 ' . 
. Calc.CGRPJAPP 
: XNTF M 
^'.Clusterin \ y 
'Connexin . 

V . ACTH^domain *. 

' Cys_knot * 
*. P'X , 
. Endothelin \> 

Ephrin 

EPhJbd 
■ FCF 
• Frizzled 

Hormone6 

Clypican 

Granin . ; 
- Cuanylin, - _ ; 

Insulin 

ICFBP 

Leptin 

X)ink 
:rNGF . . :; . 
., .:Neuregulin . .„ . 
: Hormones 

NMU 

Notch 

Osteopontin 
Hormone3 
Parathyroid 
H6rmone2 
: PDGF, - - 

Sema 

Somatomedin_B 

Hormone 

Sorb 

SCF 

Syndecan 

TNFiLce 

TCF-p 

Uteroglobin 

Opiods neuropep 

Wnt 

ANATO 
Clq 

Disintegrin 

F5_F8_type_C 

COLFI 

Fnl - . 

Fn2 

Kringle 

MACPF 

Pentaxin 

SAA_proteins , 

Sushi c 

TSPN 

Tissue_fac 

TransglutamIn_N 

Transglutamin^C 



'•. . c" ■ ■vV;::^J;\IDev'e/^^ homeostattc 
Adrenomedullin . /T/., . . .^r^^^.V: -i^;,. 
b-; \ Atrial natriuretic peptide v.:i>. - • > ; , . > 
-7 -Cadherin 'domain !• ri"^ •iC5^ n;n.^^?ai?:;^'n-A^fv 
;.y;pldtoriin/CGRP/IAPP;famil^^ • : . ; - 

; Ciliary neurotrophic factor , • - ^ - - , . - ; 
: Clusterin ; ^ . - : . : V ^ 

Connexin .'■'•}' / , . . . ^ ; ; . 

^; Corticotropin' AGTH domain 
rjf Corticotropiri-releasing fartoir^amily^" T--! J-^ 
Cystine-knot domain . ' . . - - . 

Dixdomain ^ J . ,/ - . ^ a\ • 

•-;Ehdothelin family J " ■ ' 
Ephrin 

Ephrin receptor llgand binding domain 
; Fibroblast growth factor 

., -Frizzled/Smoothened family membrane region V 
Glycoprotein hormones 
Glypican 

: Grainin (chromogranin or secretogranin) - 

Cuanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like grovrth factor binding proteins 

Leptin . . 

LINK (hyaiuron binding) > 
; Nerve growth factor family ' r, . 

Neuregulin family - -V ; - - 
" Neurohypophysial hormones ^. '*/ • . ^ y ''^''^'P-^^\' 

Neuromedin U - 

Notch (DSL) domain . 

Osteopontin 

Pancreatic hormone peptides 
- Parathyroid hormone family 
Peptide hormone 

J Platelet-derived growth factor (PDGF) "" 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 
Stem cell factor 
Syndecan domain 
TNFR/NGFR cysteine-rich region 
Transfonming growth factor p-like domain 
Uteroglobin family . 
Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 
Clq domain 
Disintegrin 

F5/8 type C domain ' ' ' 

Fibrillar collagen Crtermlnal domain 

Fibrpnectih type I domain : 

Fibroriectin type II domain ; ■ . 

Kringle domain • . 

MAC/Perforin donrtain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminal-like domains ' , 
Tissue factor " . 

Transglutaminase family " " 

Transglutaminase family 



'regulators ;^ , ' V 
■ : . 100(550) 

........ 

- : .''4(16) ^.:: 
1 

2 

10(11) 

7(8) 

' 12 
- : . -23. 
^. 9 
1 
14 
3 
1 
7 
10 
1 

,13(23) 

>.„.;3 
."".'4 
1 . 
1 

3(5) 
1 
3 

5(9) 

■^5 

27(29) 
5(8) 
1 

' 2 
2 

17(31) 
27(28) 
3 
3 
18 



;f/:.:^:-0,- 
14 (157) i 

■.^^-:^^-0.;' 
:^ -0 • ' 

^"•■-r 0, •' 
iP J 

;vv. .^2-v 

. '0 ■ 

2 ., 

7 
0 
2 
0 

0 " 
' 4 . 
0 

.0 

, 0 ■' r 

- .0 
. 0 
0 

.2(4) . 
0 
0 
0 

. 0 - 

8(10) 
3 
0 
0 
0 
1 
1 

6 . 

0 

0 

7(10) 



: .0 

*16(66) \ 

.■.■,..0 .-^r^'!?.: 
O.i:-^- 
' ■ --O 

. ::o:.' 

,-„,:,^.0.;4-.^ 
:ov'^:^- 
.0 - 

'r--- 4 ' 

0 

•'1 ■■■ 
■■ ■'■-1 

. ■ 3 
0 

1 . . 

0 
0 
0 
0 
0 

0 
6 
0 
0 

2(6) 
0 
0 
0 

- _.o 

0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 



6(14) \ 


.0 


24 


0 


18 . 


2 


15(20) 


5(6) 


.10 




5(18) V 


: 0 


■11(16) 




15(24) 


■ - 2 ■ ■ 


6 


. 0 


9. 


.0 , 


4 




53(191) . 


11(42) , 


, - 14 ■ r 


.; ..1 


1"** i 


0 


6 


1 


8 


1 



0 
0 
3 
2 
0 

' 0 
0 
2 
0 
0 

^ 0 
8(45) 

:! 0 
0 
0 
0 



Jo- 
OA 

.Av. O 

0 

Ao ■ ■ 

0 
0 

. 0 

:0 

0 
b 
* 0 
0 

r 0 
■ 0 
0 
0 
0 
0 
. -0 
tCi- 
= 0 
0: 
0 
0 
0 
0 
0 
0 . 
0 
0 
0 
0. 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 ■ 

0 
0 
0 
0 
0 



,v??o^ 
v^.^o: 

- > 0- 

' ^^0 ^ 
• 0 

• 0 

Q 

'0 ' 

0 

0 

0 

0 

^0 
0 
0 

:0 
0: 0 
TO 

6 

0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 

* 0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

. 0 ' - 

b 

0 
0 
0 
0 
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. Table 18 {Continued) 



THE Human GENOME f-~ 



Accession 

ri,;;;-nurriber/'yi 



^ ^..PF00711 - 
:PF00748 ; 

v ?F00129 



.r;PF00993 
■ >F00969; 
; : PF00879 
' PF01109/ 
PF6o047 
PF00143 
,PFdb714 
PF0b7Z6 ■ 
.PF02B72 . 
-fiPF06715 : 
-PF06727 
PF02025 
; PF01415 
PF00340 
; PF02394 
PF02059 
PF00489 
PF0i291 

PF00323 
PF01d91 
PF00277 
PF0004a 

PF01582 
• PF00229 

PFOoqss 

PF00779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 
PFt)(@9r' 
PF00503 
PF00631 
. PF00616 
PF00618 

PF00625 
PF02189 
PF00169 
PF00130 

PF00388 

; PF00387 

PF00640 
: PF02192 

PF00794 

PF01412 ' 

PF02196 
; PF02145 

PF00788 
; PF00071 
■ PFOO6I7' 
[ PF00615'/ 
'PF02197 



Sv^'GMjCSF 

; Interferon 
; IFN-gamma 
IL10 

lUS/ 

:r\l2 

..■■■|L5.-'^'-:"^ :^ 

\i\ 

IL1_propep 
IL3 

LIF^OSM 

Defensrns 
PTN_MK 
SAA4)roteIns 
IL8 

TIR V 
TNF 
Trefoil 



^^Defehsin^beta^^ ^^^k^Bete &ef«^^ 

' ' > ^k;::'ca!pS^^inhjbitS^ ■ 

■ - r^^^ 

• Interferon alpha'^^^ . 
Interferon gamma 
lnterleukln-10 : ' - 
j lnterleukin-15 -t'^-;^:'^ h 

:^r-y lnterleukrn-2- ■:t^fe^:~v-;:^::-:r- v^;:-:::-^^^ 
' : ^;lnterteukin-4 V ; - V - > ■ ; 

, ..k lnterleukin-5 ' ir^ . " . . : ■ ; 
-J ^ lnterleukin-7y^ family • ' ■ : • ^ T /n;: 
Interleukfn-I \. ■ ■ ' 

- lnterleukin-1 propeptide 

■ jnterteLikin-3 ' ■ ^r^^^t-^^v^ ;;: 

lnterleukln-6/G-CS>/MGF family . 
*"^famH^^ factor (LIF)/oncostatin (OSM) 

Mammalian defensin 

PTN/MK heparin-binding protein V ' 

Seriim amyloid A protein 
Small cytokines (intecrine/chemokine) 

interleukin-8 like 
TIR domain . : ■ . 
TNF (tumor hecrdsis factor) family : 
Trefoil (P-type) domain - - ^ 1 : : : ; : 

BTK motif ''" '" ^^^ y ■^W-;>l^/^o C77>a.e5^;,a^ 
C2 domain 

Diacylglycerol kinase accessory domain (presumed! 
Diao^lglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Eel-10, and 
Pleckstrin (DEP) \ 
" ^^E?!nc finger . . . 
GDP dissociation inhibitor ^ ' ~ — -^ --^ 

G-protein alpha subunit . 
G-prdtein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Imniunoreceptor tyrosine-based activation motif * : : 
PH domain 

'^^^^^^ «*ers/diacylglycerol binding domain (CI 

PhosphatidylinositoUspedfic phospholipase C, X 
domain 

Phosphatidylinositol-speclfic phospholipase C Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3.kinase family. p85-blnding domain 
PI3-kinase family, ras-binding domain 
Putative CTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-S) domain 
Ras family 
RasGEF domain 

-RegulatpVofC protein signaling domain 
Regulatory subunit of type II PKA R-subunit ' ^ 



^.^:?18(20)-5,&(.^-^^^^ 



0 
0 



BTK^ 
02 

DAGKa 
DAGKc 
DEP 

_ FYVE „ 

. G-alpha . 
G-gamma 
RasGAP 
RasGEFN 

Guanylate kin 
ITAM . 
PH : 

DAG_P£-bind 
PI-PLC-X 
PI-PLC-Y 
PID 

P13iep85B 
PI3ierbd 
ArfGAP 
RBD 

Rap-GAP 
RA 

Ras J 
RasGEF 
RGS 
Rita 



■^^r>V:;:c^5(6) 
-i 

:k;:381(930J: 
^ 7(9) 
1 

■ ■ .r 

v.- 

7" ■ 
2 V 

/:r .2,- 

4 
32 



.0 

'* 0: 
0 

-vr 0 > 

0 
0 



18 > 

":.,::^^t:i2::V:^ 
: ;-; ^5(6):?: 7 

5 

73(101) 
9 
10 
12(13) 

^ 28(30)_^. 
6 

27(30) , , . 
16 

11 V 
9 

12 
- 3 
193(212) 
45(56) 

12 

11 

24(27) 
2 
6 
16 
6(7) 
5 

18(19) 
126 

; 21 , 

>27 - 
4 



32(44) 
4 
8 
4 

.-~^.;;^14-.:. ,. 
Z 

. 10 
5 

- ' 5'*' 
2 

8 
0 

'72 {78) 
25(31) 



13 
1 
3 
9 

■ -4- 

■ 4 ' 
7(9) 

56(57); 

.^6(7). 



HV^i;0\O i 
Q ■ 
67(323) 
0 
0 
0 

■ T-'-^v^'O"; 
^:.-;.ro 

0 

.. 0 
0; 
0 
0 

0 
0 
0 
0 



2 

v' -2 
0 

24(35) 
7 
8 
10 

— 15,. 
1 

.20(23) 
5 
8 
3 

7 
0 

65(68) . 
26(40) 



11(12) 
1 
1 
8 
1 

' 2 
6 

' 51 
t 7 
12(13) 
2 



^' 0 

0 
0 

' 0 

70' 

0 

■ 0 
-0 
' 0 
0 

0 

0- 
0 



0 

- '0 
0 

.6(9) 
0 

5 

S 

1 
2 
1 
3 

1 

. .0 
24 
1(2) 

1 

1 

0 
0 
0 
6 
0 

0 ' 
1 

:..23A:- 
5 

1 ^ 



0 
0 

- 0; 

0 ; 

0 
0 

/. 0 ; 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



131(143) 
0 
0 

0 

66(90) 
6 

11(12) 
2 

- 15 
3 
5 
0 
0 
0 

4 
0 
23 
4 



8 

0 
0 
0 
15 
0 
0 
0 
78 
0 
0 
0 
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Table 18 (Co/jt//»i/ec/) r * 



i Accession - j:.; ; 
'.'number - 



/Domain name 5/;?^^ 



H 



. >PF00620 - 
■ PF0062i. 

PF00536i^ 

-•:;:PF01369 

■i;^PF000i7 ; 

/;7,PF0iCj018r*^ 

. PF0i017 
■:. PF00790 ; 
ri^r PF0dS68 ^? 



RhoGAP 
RhoGEF 

^sm / 

Sec7 .' 
:SH2 

SH3 : V 

:.STAT . 

Whs . . 



PF06452 


>v Bcl-2 ■ - 


. ;PF0218p 


:\ BH4 ■ 


. PF00619 . 


CARD 


.PF00531 


s - Death > 


r PFdi335 


;';-:ded. : ^ 


: ' PF02179 


::;;.BAG , ■ 


: ' PF00656 


' ICE_p20 .. . 


PF00653 


BIR'-^l.: * 


PF00022 ' 


Actin " V 


PF00191 


• Annexin - . ■ * 


: . PF00402 


Calponin 


. PF00373 


Band„41 


PF00880 


Nebulin„repeat 


• PF00681 


Plectin_repeat 


PF00435 


. Spectrin 


PF00418 


Tubulin-binding 


PF00992 


Troponin 


PF02209. 


VHP 


PF01044 


Vinculin 


PF01391 


Collagen 


PF01413 


^C4^ /• ■ 


PF00431 


CUB 


PF00008 


EGF 


PF00147 


Fibrinogen_C 



PF00041 

PF007S7 

PF00357 

PF00362 

PF00052 

PF00053 

PF00054 

PF00055 

PF00059 

PF01463 

PF01462 

PF00057 

PF000S8 

PF00530 

PF00084 

PF00090 

PF00092 . 

PF00093 

PF00094 

PF06244 . 

PF00023 

PF00514 

PF00168 

PF00027 - 

PF01556 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



! Fn3 

Furin-lilce 

Integrin^A 
...Integrin.B 
. , Laminin.B 

Laminin^EGF 

Larriinin^G 

Laminin_Nterm 

Lectin^c 

LRRCT 

LRRNT 

LdLrecept^a 

LdLrecept b 

SRCR 

Sushi 

Tsp_1 

Vwa 

Vwc 

ywd 

14-3-3 
Ank 

Armadillo seg 
C2 

cNMPlbinding 

DnaJ^C 

DnaJ 

Efhand** 

FCH 

FF 

FHA 



RhoGAP. domain " \\ [.'''. 
; y-: RhoGEF domain : ^ 

% : '/SAM domain (Sterile alpha motif) - ^ 
r \:':Sec7 domain -.yJ- ^: ^ \ >^ ■ ! ' . - ''^l 
vuk Src homology 2 (SH2j domain 
■ ^ ;Src homology 3 (SH3), domain ? ; ^^^V 

V' STAT^protein f^: X/ A: ^ ^ 

;^VHSjdomain^^^^vv:^t>g■ r ".' \ •.:V.: 

;-..;iWHi;dpmain>:\^ ^'f^ 'r ■ 'r^ 

' " . s V^^'*- iv-i \r ^ Domains involved in apoptosis . : 
Bcl-^te:^ >'-N;.!-^^f^*'r ' - ,: / ./ * . c ' ^- / • 

. BcIt2 homology region 4 ^ ^' V' - ^.: '^ ^''Ai^ 
::; Caspase recruitment domain ; 
^ Death'clomain.^ 'J ; / ; . . 

"r ■•■ Death effertor'dor^ih"^^^ y'^'^/'^Ti 
V Domain present in Hsp7b regulators - . - 
V JCE-likejprotease (dsp^^^ X-^^ 
Inhibitor of Apoptosis domain v - • ■ . '* ' ^r';.' 

■•Annexin"^'* ' ■ .J''" ■ 

Calponin family : ' , 

FERM domain (Band 4;1 family) 
• Nebulin repeat 

Ptectin repeat V'^ 
' Spectrin repeat ,. .A . ' ' ' ' ' ■ 

Tau and MAP proteirls. tubulin-binding 

Troponin^ _ 
- Villin' headpiece domain ;^ 
" . Vinculin family . .11 ...... 



^■..■^59 • 

oacL?S;29(3i): 
'U^x:v:^;;:87(95):; 

■ " ^;143(182) ' 

3 - • 
: -16 

;^4(5) :; 

?{5(8)^; 

8(14): 



61 (64) 
V16(55) 
-.13(22) 
29(30) 
4(148) 

;^2(ii): 

31 (195) 
4(12) 

.--4 
■ '5 

.4 ; 



: "r* • "..ru^.:.:::.:r s f CM ddhesion ' ^ - 
: : Collagen triple helbrrep<^ 

C-terminal tandem repeated domain in type 4 

procollagen , . 
CUB domain 

EGF-like domain ... . 

Fibrinogen beta and gamma chains. C-terminal 

globular domain 
Fibronectin type Mi domain 
Furin-like cysteine rich region ■ " 

Integrin alpha cytoplasmic region 
Integrins. beta chain , 
Laminin B (Domain IV) . 
Laminin EGF-like (Domains lll and V) 
Laminin C domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-tenminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von \yillebrand factqr type D. domain 

Protein interaction domains 

14-3-3 proteins/ 
Ank repeat ^\ : ' .. , 

Armadiito/beta^tenin-tike repeats 
C2 domain - j 
Cyclic nudeotide-binding domain 
DnaJ C terminal region 
DnaJ domain ' ' . > ^ -^ c 
EFhand ; . : . _ ' , 
Fes/CIP4 honK>logy domain 

FF domain 

FHA domain - ■ ^ ' - ' ' ^ 



. . =: .19 . 

::a23(24) % 

•;.\".:'-;T5 ■ 
''..5 ; 
:>^:;l33 (39) ^ - 

v:> \55(7s); , 

:c"^.;^:2:^.: 

2 . 

^'.:fv■^^,. 3 ' 
- . 5(9) : 
; 15(16).^ 

::-4{16) ^ 

17(19) 
r .1(2) 

13(171) r 

1(4) 

,/ ■ 6 

• --./,:-2 ■'• 

::^bAf.:2^-^ 



20' 
.^^8(19), 

;>;":8*^ 

^:44(48) 
; 46(61) , 
: 1(2) 

^2(3): 

.'■1 

' . '\ 1 . 

■:**2 

::2(^; 
12 

4(11) 
7(19) 
11(14) 
1 
0 

10(93) 
2(8) 
8 
2 
1 



':^65 (279)3 
-6(11) 



.10(46) 
: 2(4) 



47(69) 
108(420) 

. 26 

106(545) 
5 
3 
.8 
8(12) 
24(126) 
30 (57) 
10 

47(76) 
69 (81) 
40(44) 
35(127) 
15(96) 
: 11 (46) 
53(191) 
41 (66) 
34(58) 
19(28) 
15(35) 



9(47) 
45(186) 
: 10(11) 

.42 (168) : 

2 

..^ 

4(7)^ 
9(62) 
18(42) 
6 

23(24) 
23(30) 
7(13) 
33(152) 
9(56) 
4(8) 
11(42) 
11(23) 

0 . 
6(11) 

m 



'.' 20 


:,3'; 


f- , 145(404) 


72(269) 


. 22(56) 


11(38) 


73(101) 


32(44) 


26(31) 


21(33)" 


12 


•'-■-■■'9 


■ 44 ■ 


34 ' 


83(151) 


64(117) 


9 


3 


;v 4(11); 


. 4(10) 


"* 13 


. 15 



174(384) 
: 3(6) 

43(67) 
54(157) 
6 

. 34(156): 
1 
2 
2 

- 6(10) 
11(65) 
14(26) 
4 

91 (132) 
7(9) 
3(6) 
27(113) 
7(22) 
1(2) 
8(45) 
18(47) 
17(19) 
2(5) 

'9 ■ ■ 

-3 . 
75(223) 
3(11) 
24(35) 
1 15(20) 
5 
33 
41 (86) 

3(16) 
7 



.9 

^;:^:^./.--3.; 

^•-^'>-5^ 
^23^27)^ 

*^:(b^?l' ■ 

^■■■'■,;-o • 
J -6 

, V 0 : 
;i(2) 

•9(11) 

.0 
0 
0' 
' 0 
^ 0 
■0 
0 

^0 

'yO 



: 8 

;.r>^>.^>^:^'- 0 
■-';^;:n7;-6 
---r" -9. 

4 

6 

:~2i0 - 

^"■'r,Vo 
^i^5: . 



^--24, 
.6(16)^ 
0 
0 

: 0 

-0 

0 
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1 

0 
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0 
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0 
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r'myelin proteins result in severe demyeima^ 
tion, which is a pathological ^condition in 
which the myelin is lost and the nerve con- 
.duction is severely; impaired (130): Humans ^ 
whave^at.;least -10 ; genes, belonging ^to^ four 1 
' .different families involved in myelin'i)roduc-l 

Table 18 (Cbntwi/e</) . : - i^^^ 



T H EH U M A N G E N O M E 



tion'(five myelin PO, three myelin proteolip- 
id, niyelin basic protein, and myelin-oligo- 
;; dend^pcyte glycoprotemv or MOG),' arid pos- 
:?;:;sibIy,;mpre-remotely;TeIated:me^^ 
ilS^i^^^SJ^^>^^l^i^ myelin 



V InterceUular and intracellular signaling 
pathways in development an^ hopieostasis. 

V^.Many prdteiuvfanulies^t^^^^ 
.r^i^iuni^Telatiya^ are 
i;vvyolved:in>ignaiing^ 

.v^response; to;;5}eYeiopment ;khd aiifferentiation 



^y; Accession ' 
-ngmber : 

PF0b254 
PF01590 
:: PF01344 
• ^ PF00560 
;.PF00917 
PF00989 
PF00595 
PF00169 
. PF01535 
■PF00536 :. 
PF01369 
PF00017 
PF00018 
PF01740 
PF00515 
PF00400 
PF00397 
PF0b569 

PF01754 
PF01388 
PF6i426 
PF00643 
PF00533 
PF00439 
PF00651 : 
PF00145 . 
PF00385 



■Doifiain name 



PF0012S 
PF00134 
PF00270 
PF01529 
PF00646 
PF00250 

^PF0032<f^ 
PF01585 
PF00010 
PF00850 
PFd0046 
PF01833 
PF02373 
PF02375 
PF00013 
PF01352 

PF00104 

PF00412 
PF00917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF00076 

PF02037 ' 
PF00622 ^ ' 
PF01852 
PF00907 



-rFKBP 

:(CAFS-.-: 

^MATH .. ' 

: pas - 

PDZ . 

PH " 

:.: JaM 

" Sec7 r ^ : 

SH2 ^" V 
' SH3; v 

STAS 

TPR**" 

WD40** 

vAv ; 

2Z 

Zf-A20 
ARID : 
BAH 

Zf-B_box** 
BRCT : , . 
Bromodomain 
BTB 

DNA^methylase 
Chromo 



Histone 
Cyclin 
DEAD 
Zf-DHHC 
F-box** 
, Fortehead 
GATS"-^ '^ 
G-patch 
HLH** , 
Hist_deacetyl 
Homeobox 
TIG 
JmjC 
JmjN 

XH-domaIn 
KRAB 

Hormone_rec 

LIM 
MATH 

Myb.DNA-blnding 

Myc-LZ 

Zf-MYND 

PHD 

Pou 

RFX^DNA^binding 

Rrm . . 

SAP- ' ■ 

SPRY J;; 

START 
T-box 



. ^pomafn descripti \ 



FKBP:1^pe pepti<^l-*prblyl ds-trans Jsornerases 
.._CAFdomain /■'^)^:f:^:y-r:-'' c: .yi-i , 
>^.-'-Kelch motif ,,4-;./:;:.^;f^rr;-\:^ y; 
-/ Leucine' Rich Repeat " '^\f']'^^^ ' " ^ry^ ^. - ■ • 
V MATH domain r v ■ . ^ ■"■ 
PAS domain "A^. ^t^^-^^^-'^iy-^--^^ 

PPZ domain (Ako foiown as DHR or CLGF) - 
.{PH domain s v : : : ' . ■ 

PPRrepeat--; ^ . \": 

SAM domain (Sterile 'alpha motif) 

Sec7 domain ^v-X^ ;^^ ' 
:: Src homology 2 (SH2) domain 
. YSrc homology 3 (SH3) domain • 

STAS domain A'-j ^ ' : 

TPR domain : ■ . ^ . \ i^w. . 
, WD40 domain '■■■>4t.lV^ 

w\y domain :-r:\ " . .; ■ - ; ^ 

ZZ-Zinc finger present in dystrophin, CBP/p300 

. ^ ^ ' : : t ■ * Nuclear interaction domains 
A20-like zinc finger 

ARID DNA binding domain 
BAH domain - ■ ' 

B-box zinc finger . - . : . 

BRCA1 C Terminus (BRCT) domain 
Bromodomaia - 
BTB/POZdomain ; v ; - ; , 
C-S Qrtosine-specific DNA methylase v^v-Vv^ kv-^ 
chromo' (CHRromatin Organization Modifier) -" ^ 
domain " ^'^^ : " t j-^-"-' 

Core histone H2A/H2B/H3/H4 
Cydin. . 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 

Fork head^omain^^^ . .1 . . 

CATA zinc finger " ' ~ ' 

G-patch domain 

^ Helix-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain : . , 
IPT/TIG domain 
JmjC domain 
JmjN domain 
KH domain . 
KRAB box 

Ligand-binding domain of nuclear hormone 

receptor 
UM domain containing proteins 
MATHdomaIn \ ' ... 
Myb-like DNA-binding domain 
Myc leucine zipper donriain 
MYND finger 

PHD-finger v y ~ '■" 
Pou domain— N-tenmlnal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.k-a. RRM, RBD, or RNP 
domain) 
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, / (Tables 18 . and 1 9). .They include, secreted . 

: -. .^hormonesVand, growth factors, :receptors^.in-:T 
tracellular signaling molecules/ iad transciip-y^ 

' - tion factors. - . . ' } " ■ ' 

v. Developmental signaling molecules that are,; 

, 'J.eiinched in the human genome include growth 
factors such as wnt, transforming growth fac- '. 

>. tor-p (TGF-p), fibroblast growth factor (FGE), -: 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect ..tissue differentiation and a wide^ 
range of cellular processes involving actin-cy- ■ " 

..toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 

.;ple, our .analysis suggests at- least;.8. human 

' . ephiin genes (2 in the fly, 4 in the wpmi) and 12 
ephrin receptors (2 in' the fly, 1 in the worm). In : I; 

. ..the wnt 'signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 fiizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 

. corepressors downstream in the wnt pathway . 

even more markedly expanded, with 13. 

predicted members in humans (2 in the fly, 1 in 7 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome . 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(131). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (J 32), we observe an expan- 
sion of the heparin sulfate sulfotransfeiases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue difFer- 
entiation (133), A similar expansion in humans 
is rioted in structural proteiris th^^^^ . 
actin-cytoskeletal architecture; Compared with 
the fly and wonm, we observe an explosive 
expansion of the nebulin (35 domains per prx>- ' 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
I protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



i^^^'Hiparison across ithe^^ eu 
^i:.ka3y.otid*:6^^^ 

rv^ed protein fanulies and domains :inyoived in; 
v^^f cytoplasmic signal ^arisductiph^^ sy ] 

; ^^In^^particular, ;.sigpal ; transduction^pathw - 
-\:.playing roles m.dey and 
r.;* acquired :-immmuty..;were:x substantially 
;,riched.. There is . a factor of .2 ,or* greater - ex- . 
/ pansion; in hiirnans in, .the ^Ras, superfamily 
' GTPases and the GTPase. activator and GTP:, 
: ; exchange ^factors associated .with them. ^Al-^ 
v^though . there .„are.about, the, same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in. 
the .SH2, PTB, and lTAM dorn'ains inyolved 
i, ill phosphotyrosine signal;traiisduction.; Fur- 
- -..ther, there :is; a : twofold .expansipn>of .phos->: 
V phodiesterases ;in , the ..human ;genome..com^ % 
; .pared with .either the worm or fly genomes. ; - ' 
. The downsteeam effectors of the intracellu- . 
- lar signaling molecules include the transcription . 

factors that transduce developmental fates. Sig- 
. nificant expansions are noted in the ligand- 
. binding iiuclear honnone receptor class of tran- 
. .scriptipn factors compared yvii the ^y genome, 
,; .althou^ not to .the extent observed in the worm 
.'(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors, Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not . only in the number of C2Pi2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in tiie fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN, do- 
: miains, which are not found in the fly or worm 
genomes. These, domains are involved in the. 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared betweeri the three 
animal genomes, but the reassortment of th^ 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



^:^^lhQmep^c^i^I^ -v^^ / 

'^S ^ou;:=,anilLIM|d^ '•■^ 
^; A ge^omes^fc^lante^ Qf\ 

/^I^^P^^ VPl ■ 

v^Pand^"AP2j^dcraa^^ QS^y 
:-V;|T^eye^;geii6^^^ 
'ifactors! rornpaied ;5wtii .iihe' ^rniilidceUd^^ eu^ . 
^ karyotes,-- and. its repertoire: is to:the : 

. ; .expaiision of theyeast-sp^ C6 transcription 
factor family Wplyed metabolic regulation. 
^.S' /AVWle we have^^^^ expansions in a . 

subset.of signal transduction molecules in the 
human genome compared with the other eu- . 
^v karyotic :genomes,v,itvish^ that 

; worms and hurnans^ ^have approjina^^^^^ : the ' ' 
. ./same; humber^^f " bbth^jtyrosine kinases and ■ 
/serine/threonine kinases.(TabIe 19). It is im- 
portant to notCj however^ that these are mere- 
ly counts of the catalytic domain; the proteins 
. i that contam these domains^, also display a 
"/ Iwide .-repertoire .'of interaction domains \wth"' 
:?;;^igm"ficant=cdmbinaLtbrial 'diversity. 
^fr^K-^cmostasisivHeimo regulated pri- 

marily by plasma proteases of the coagulation 
pathway and by the mteractions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as EMAC, FNl, FN2, 
and Clq tiiat mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
: trix. In addition, there , has been extensive re- . . 
cnutment of-HKwe^ancient animal-spedfic do- 
mains such as VWA. VWC, V\VD, kringle, . 
and FN3 into inultidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num-1 
ber of serine prioteases, fliis enzymatic dpmam 
has been specifically recruited into several of 
these multidomain proteiris for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kiniff andconiplcment pathways. Hiere is a 
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significant e;q)ansion in two families of matS- 
metaJJoproteasesi ADAM »(a disintegrin and 
metalloprotease) and MMPs (matrix metaUo^ 

proteases).(TabIe;i9). Proteolysis of extracel- 

,vMar.jnatrix(ECM)pro^^^^^^ 

' - eases such as cancer; artMtis. Alzheinier's dis: 

:, iil35,J36). ADAMs are a family bf integral 
7 P??^?i9s:^yift a piyotd TOk 

^:^genolysis^^d .mbiulating^iintei^ ;be,. 
;/:tweeii .h«niatopoietic:^t(^ 

-vascular ■Inatrix^components. Thes^ proteins 
;;:have been shown to .cleave inatoix proteins 
,;.,and , even signaling . m61eciiles:;fM>AM-I7 
irfn^x?,;!^" ' :fact^-«, -and 

.;;ADAM:10 has been iniplicated M the Notch 
signaling pathway (/J5). We have identified 
19 members of the matrix metalloprotease 

^^^^^ °f 
J ADAM and ADAM-TS faSilies! 'K ' '^i'^^^:^ 

r : - Apoptosis. Evolutioniy' donsemiion 'of 
-some , of the apoptotic. pathTOy/components 
across eukaiya is consistent with ite central 
role m developmental regulation aid as a 
response to pathogens and stresis'signals. The 
signal transduction pathways invbived in pro- 
cammed cell death, or apoptosis, are medi- 
ated by mteractions between well-character- 
ized domains that include 'exiracellular do- 
mains, adaptor (protein-proteiri interaction) 
domains, and those found in effector and 
.regulatory enzymes (137). V/e enumerated 
tte protem counts of central adaptor and ef- 
fector enzyinc domains that are found only in 
the apoptotic pathways to proVide aii' estimate • 
of divergence across eukaiya aid' relative 
expansion in the human genome when com- 
pared with the fly and worii (Table 18) 
Adaptor domains found in proteins festricted 
only to apoptotic regulation such as the DED 
~TJ^^?;^^"*«t?-specific, whereas oth- 
ers like-BIR,^AR]5,-and Bcl2 are represent- 

" ,o r ^.f"'' ^"^"^ *e number 

Of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules 
namely the para- and meta-caspases, have 
been reported in these organisms (138) Com- 
pared with other animal genomes, the human 
genonie shows an expansion in the adaptor 
and effector domain-<:ontaimng proteins in- 
volved m apoptosis, as well as in the pro- 
, teases mvolved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families^ 
Metabolic enzymes. There are fewer cyto- .: 
chrome P450 genes in humans than in eiAer 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific" 
to the vertebrates and plants, whereas the lip- 
: o^gfnase-activating proteins (four iii'humans) • 
may be vertebrate-specific. Lipoj^g^es are .7 
mvolved m arachidonic acid nietebolism, and ■ 
they and their activators have b^en implicated 



Th e Hu ma n ge no me 

;^in ^tiiyerse, iuman iiithology ranging from 
: allefgic.r^ponses to cancere. One of the most 
, ,^^ri?mg Aumaii wqjansip^^ in 

^V;?^leh3«h<)genase?(GAPDH)-^^^ 
,y^;mMs. 3,in the.i]y;and4 ii thb worm)' .There' 



m 



.vposed GAPDH.pseudogenes (139), which 

expansion. 

However, iitiis^^^^^^ 'GAPDH long 
ji/.ta<))yn;as^ invoked- in 
Ai.1basic.^netabp«smfqundacri)sS^ 
;,';bactenatoliupiaM;'haVxe^^^^^ 
•%<S*^y«o*«';-fiinctions.^^^^^^^ 



•rrrTAmfloride^sensltlve/degehe^^ - t^'Se — 



^Acetylcholine receptor v : -^'^ 
rTAmHorWe-sensitive/degene^^^^ 

^ Neurotram 
T:;-^ ?2X punnoceptdr' ; 

X:4:Tnahsierit recejSor^ - ^^-^'^ 
uj; Voltage-gated Ca^"^ alpha'^ ^-^ 
Voltage-gated Ca^+^alpha-a-^"^^^'^^ 
Voltage-gated Ca^t-beta r >■ r 
. Voltage-gated Ca^r gamma l 
Voltage-gated K+ alpha 
Voltage-gated KQT ' 
^ Voltage-gated Na+ ■ ' - ^ - 
^Myelin basic protein^- - -^-^^^ "^'^^ 
Myelin PO . 

Myelin proteotipid ■ , , ^ - . - • • 

: .Myelin-oligodendrocyte glycoprotein ' 
Neiirbpilin • " ' " : / :v 

^'Ple)an-^"?^^t^^^^^^^^^ 

'. Serriaphorih > 

Synaptotagmin 



33 

6 
11 



Defens'in . . - - V .^^ " 
Cytokinef' 
CCSF - 
GMCSF - ^ 
Intercrine alpha 

" . Intercrine beta: i ; ^ 

Inteferon ^ 

Interieukin . . : . , , . 
Leukemia Inhibitory factor 
'^C5F ' ■ / '■ ; 

Peptidoglycan recognition protein 
Pre-B cell enhancing factor 
Small Inducible cytokine A ^ " 
SI cytokine - ' . r.. .. 

Cytokine receptorf 
Bradykinln/C-C chemokine receptor 
Fl cytokine receptor 
. Interferon receptor 
Interieukin receptor 
Leukocyte tyrosine kinase 

receptor .. . . , . . 
MCSF receptor 
TNF receptor ; ' 
Immunoglobulin receptorf 
T-cell receptor alpha chain 
T'CeW receptor beta chain ^ ^ 

T-cell receptor gamma chain - " <'■<■•- 
T-c^n receptor delta chain 
Immunoglobulin FC receptor - 
Killer ceU rWejptor ;.: ., . ., "'7.- 
Polymertc-lmmunogiobulin receptor /^, 
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tributes.to gene regulation at either the splic- 
j.^jing ,or .translational.ievel ^ 
v'. : Posttrdnslatiqnal .modijications.khi ibis 
i :^set of processes, tihe ,most ;proniinent expan- 
: sion is'the ti^sgliitaminases, calcium-depen- 
. vdent enzymes that catalyze the cross-linking 
. of proteins in cellular processes such as he- 
mostasis and apoptosis (147), ,The vitamin 
K-^dejperident gamma carboxylase gene prod- • 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors; 
• osteocalcin, and matrix. GLA protein (148): [ 
Tyrosylprptein :isulfotransferases'v^particip^^^^ 
• in the posttrarislationai modification of pro- 
r; teins^ inyplved ;in;ihflanimadon.*anci hemosta- . 
v . sis, including coagulation factors and chenio- ' 
kine receptors. (7¥P). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
■ ification, ; there are a nimiber of domain ar- : 
^■'rangeraents in the predicted hi^'an proteins 
. that are not . found in the other currently se- 
; quenced genomes. These include the tandem 
association of two histone deacetylase do- ; 
mains in HD6 with a ubiquitin finger domain, ' 
a feature lacking in the fly genome. An ad- - 
ditional example is the co-occurrence of im- 
portant nuclear regulatoiy enzyme PARP 
(poly-ADP ribosyl transferase) domain fused . 
to protem-interactioh domains— BRCt . and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans - 
when compared to the fly and worm. Some of ' 
these relate; tp the. prominent differences in 
the immxme '^tem, h.emoistasis, neuronal, : 
vascular, and cytoskeletal "complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, \ . , 
transcriptional and translational control, post- " ' 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to . mediate protein- - 
protein interactions without dramatically in- - 
creasing the absolute size of the protein com- ' 

:plement (75^0, ^volution of apparentty^^^ ... 

(froni the perspective of se^enc^ i^iysig) ^l;^ 

i protem.. domains;^d -increasing^ regiiktory'v^ 
^complexityby domkin accretion'boih qianti-'^S*^^ 

tativelv and mialitfltiv/*!;; . ■ : r 'CREB 



' • ' ' * • " " -V^ - V"^ 4 cfiu IS, me t;/H2 >- ■ ;> *j roucno 
izmc fmger-TCbntaming itranscHption' J^^^^^^ Histdn^ HT 
where -we :see'ex^ 

;domains ^ per - protein. ■ triof^f^r^^i^^^r^^'^ X^Histone H2B 



v^domams per-protein/ tbgetKer: wi^^^ 
.;,brate-specific donaains vsuch^^^^^ 'iid . 

SCAN.: Recent reports;©!! the prominent use" ' 
of mtemal ribosomal entry sitdsid'^'aie hi^^^ 
; genome , to . regulate translation* of speciific 
classes of proteins suggests thatffiiisa^ area 
JMierV to identify , the Ml ^ 

•extent of this :^process in 'the ' hum^' genome ■ 
(I5jy At the posttfanslationai levd," althou^' ' 
we provide examples of expansions of some * 
protem families involved in these mocUfi 
tions, ' further -experimental evidence is re- 
quired to evaluate: whether, this is corfelated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 

remain to be catabged in their entirety.'Given 
the conserved nature of the spliceosomal ma- 
chmeiy, further analysis will be reqiiirea to 
dissect regulation at this level. '\\ 

.■•8 Conclusions" ■TMr.^i'^''-^^X-' 

8.1 The whole-genome sequencing 
approach versus BAC by BAG 

Experience in applying the whole'-geriome 
shotgun sequencing approach to a divefse 
group of organisms with a wide range of 




^^Histone'H3:^:?^ 
;%Histohe'H4:^f^ 
,>-Homeotict 
;v^;rABb-B 
. . Bfthoraxoid 
, ; ^Hroquois class 

i DistaUiess ^ ; , _ ^; 
;^ Engrailed ■ , \^ v^;"^^?r 

LlM-containing ■ v 
■ i 'MElS/KNbx class ~ . 

vNK-3/NK-2 class -J: 
. . Paired box 

Leucine zipper 
Nuclear hormone receptorf 
Pou-related 
Runt-related 



" . " — ** ^^"^^ ioiigc oi Bcl-2 

genome sizes and repeat, content iJlo ws us to ir;,in.in 

assess its «fTAno+lie'~a«^ »F~»1^ r J "V^V.' t ' "T^^ ' i . 



Cadherin ^: : 
Claudin 

Complement receptor-related 
Connexin 

Ga lectin - . ; •^'^ 

Glypican . • 
ICAM ' / 

fntegrin alpha 
Integrin beta ' 
LDL receptor family 
Proteoglycans 

Bci-2 



assess its strengths^and weaknesses. With the'' 
success of the method for, a large number of 
microbial genomes, Drosophiia, and now the 
human, there can be no doubt concerning the ' 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (75, 752) demonstrate that 
megabase-sized genomes can be sequenced ' 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human, map information, in the fom of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the ' 
map (in temis of the order of the liiarkers) is 
more important than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequent- ' ' 
ing, the prior existence of maW^;*:™ ii^* ' 
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Caspase 
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22 
4 
13 

Hemostasis 
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10 
19 
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ADAM/ADAMTS 
Fibronectin 
Clobin 

Matrix metalloprotease 
Serum amyloid A 
Serum amyloid P (subfamily of 
Pentaxin) 

Serum paraoxonase/arylesterase 
Serum albumin " 
Transglutaminase 

Other enzymes 
60 . , 
46 
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^ . . ^ Splidng and translation 
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4 
4 
10 



0 
7 

9 
0 
2 
2 
0 
0 

0 
0 



Cytochrome p450 
CAPDH 

Heparan sulfotransferase 



1 

89 
3 
4 



0 

n 

0 

3 

12 
0 
3 
7 
0 
0 

3 
0 

. 0 

83 
4 
2 



.-0 
3 
1 
0 



0 
0 
3 
3 
0 
0 

0 
0 
0 

256 
8 
0 



10 
104 
80 



60 
117 



13 

265 
256 



- ^ o^uuciit^- r 1 269 -135 

ing, the pnor existence' of mappiiig &ta ^"i- P«>telnit__ ^ . 812 m 
BAC ..ones ■ IlTS^-.^^-^fS^SiSt'S^"^^^^^ 



; : •www.sciencemag.org SCIENCE VOL 291 



16 FEBRUARY 2001 



1345 



rocs. Y; 



if- 



T H E H U M A N G EN 6 M E 



^ .quencejdl mto^enl^omenc .regions and.aU^predictmg genesjshouldlumt^s nu^ 
Allowed high-quality res.q^^^^ 

BAC phys,^ map was most useful m.re-iM^mRNAiia^specifiacell^ 
^'.rgions near the highly, repetitive centromeres.. the presence of a-gene. - {-^O- -Ki^intemal^ribisbma ent;^^^^^ 
' and telomeres.:,WGA-.has been fo{md to;de-, - i t,-,.-_^.._„'..,...:, . " ^0;.!°5?™^A ™?sp™al.entt^^^^ found 

:; liver. exceUent-q 
ii unique regions < 

., size, Mdniorei_^ ^^^^^ .^^^ 

.teJ.™es.fte;Ji?M^ 
.,lessofthe^epetit.ve,j«quen 
The cost and overaUeffic,^^^ 

cloneap^ei^makesti^mmtjo p^emise^and.on,the^,asis of v3:a<^ti^dty^VWch .siiggesfethi^^ 

. as. a stand^one strBtegy..fbr,foture .large-scal^ 
:genome-se(piencmg projecte Spec^^ 
,ftonspfBAg^.oi^erdqnen^ingand;;.^^^ 

• s^encing stipes to lesdvp^igmh^ m^iispome^^quld cbnteina maxirmim of not much fSivgenome ^ S^a^btiidaily -^^^^ i:^:- 
. sequence assembly ^bat ^ot ;beieffid^^ 

r^lved^yat^ J^,. 

, are .^y worth explomig. Hybnd approaches Vy^at:by,Crov^and,Khnuia: (i55i^ Une^iaUyvi hid be^i^redicte^fSS^ 

to whole-g^ome sequencmg wdl only woik.if ..^^^^^ 

there is sufficient coverage m both fte whole-,:.v;compared.tdrn,OOOaerived by;ann^ 

genome sho^ phase.and the BAC^lone se-,.-.the fly.genonie (26^27)^nieseaigmnents'lbr;;^^^^ 

quencmg phase Our «q,enenc^ with human : .. the theoretical maximum gene nmnber were . v and are . the most- genenle^^ but 
genome assembly suggeste thatlhi^ 

at least 3 X covers or of hntH whrtU_o/>nrtmA . . «ii : _i . - . . i ■• ^ _ ' mcui mc 



at least 3 X coverage of both whole-genome and 
. ■ BAG shotgun sequence data, , ..^,^.:- i:V- 7— v:^ 

8.2 The low gene number In humans 

We have sequenced and assembled rr95% of 
. the euchrpmatic sequence of H sapiens^ Bud 
v l used a new automated gene prediction meth-, 
od to' prpduce a preliminary catalog of .the 
; human genes. This has provided a major sur-. 
- prise; We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever . 
the reasons for this ciurent disparity, only. 
. detailed annotation, comparative genomics 
(particularly using the Mus .musculus ge- . 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
. following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'" and 5'-imtranslated leaders and trailers; 
the little-undei^tood vagaries of RNA prp- 
• cessing that often leave intronic regions in an 
unspliced condition; the finding that neaf-Iy = 
40% of human genes are altematively spliced 
(153); and finally, the imsolved technical 
problems in EST library construction ^yhere 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not imcommon. 
Of course, it is possible that there are genes 
that remam impredicted owing to the absence 
of EST or protem data to support them, al- 
though our use of mouse genome data for 



thataUvgeries/have a.certam b of predicted -40%. The low GfCL isochores 



r-.mutatiori tp a deleterious state. However, it is 
.clear that, many mouse, fly, worm, and yeast 

...knockout mutations lead to almost no dis- 
cemible.phenotypic perturbations 



make iip 65% of the genonie,-aiid 48% of the 
. : genes. This inhomogeneity, the:net result of 
- millions of years of mammalian gene:dupli-. . 
{ cation, has ..been .described as; t^ 



:The.;^odest..^iiumber-^f^^^ 



; means^thatj we . must ;l6bk . elsewhef eVfor the.i. 
:{;mechanisms\,that g^ 

r.si inherent * in human developnient and the so- - 
phisticated signaling systems that maintain 
homeostasis. \There are a large number of 
ways in- which the .functions of individual : 
- genes and gene products are regulated. The 
.V degree of "openness" of chromatin structure 



;;:yarc;there ^ jow-'^' 
gene ^enisityy^iand^are these^ accidents of his- 'i- 
V toiy or driven by selection and evolution? If * 
these deserts "are dispensable, it ought to be 
. possible to find mammalian genomes that are 
. ; far smaller in ^size than the human genome. . 
. I^^^^ have genome 

^ sizes ..that are much ;Srnaiier than that of hu- 



.v,and hence.transcriptional activity.is regulated ;r:v>mans;, f^^^^^^^ a species of 

.>. by : protem .'complexes .thatv involve - histone^-fM ;a genome* size that is only 



and DNA enzymatic modifications. We enu 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intiniately linked to nuclear sig- 
nal . transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (757); meth- 
ylation of CpG islands in imprinting (J 58); 
and promoter-enhancer and iiitronic regions 
that modulate transcription. The splicepso.mal 
machinery consists of multisubxmit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (J59) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicmg. Hence, there is la 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



50% that of humans (164), Similarly, A/ww- 
tiacids, a species of Asian barking deer, has a 
genome size that is -^70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaiyotic genome in which a 
nearly uniform ascertainment of polymoiphism 
has been completed. Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of fiinding and 
cataloging SNPs is complete. These represent 
only a fisctipn of the SNPs.vpresent in the 
human popidation as a whole: Neverdieless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA Cannes with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense anay of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethno^eo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
; . such studies have suggested , that modem human 
^ :rvlmeages,;denve ;&om AMca,-^i^ 
s . questions regarding hitiman orig^ .remaiii un- 
r^-; answered, '.and more; analyses^usmg 

SNP m^s will be needed to settle these con- 
- I\tr6versies. In adcUtion to pr9\dirig ^ 
population;.!^^ 
v^ture:, Sl^^,^;:^^ 

' ' . g^es.' The correlation betw^^ patterns of in- 
; ■ V traspecies ; ahc^ internee 
i^^i^may 'prove to be^^ 
• , ' .tify sites, of reduced genetic di'^^^^ 

V mark Jdci where sequence not 
■■■■■ tolerated'. V ; 

The remarkable . hetero^enbjty ^ in ; ■ SNP 
; . V density . unplies; .that ^Sbiei^'; are^ a ^ -of 
' y; forces actyig jori pblymorpl^ re-; 
; 'gions niay have lower SNP .density.. because 
v; the mutation raite is lower, because most of 
those regions have a lower fraction of muta- 
tions that - are * tolerated, \ or because, recent 
' strong selection in favor ^of a liewly -arisen 
allele "swept" the linked variation out of the 
population (I6S), The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonfecombining ppirtion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes iri the , 
population as there are autosomal ihromo-' 
sdrnes, and the level of polymorphism on fc^ 
; Y is conrespondingly less. Similarly, the X{ 
chromosome has a smaller .effective popu- ^ 
lation size than the autosomes, and, its. nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
^s^Snous mutations m,ay vary.^Regions of. 
high density , of deleterious ihutations will 
see a greater rate of elimination by . selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large* literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
i7fl, and it remains an important task to 
assess the strength of this assdciatioii in the 
human genome, because 6f its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an ' 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 



THE Human GENOME 

, then docks on this, and then the. complex 8-5 Beyond single components 

W S!^ V* - ' i}^'^ Whil^ would disag^eemth the intuitive 

,^^l^human.(hseases:.^m^^ 
Aj^ean^^^ 

;^^>^rwc^d.jhe^ is thereay ^^^deriying nhei^ S^^^^^gc^-^P^ 

'^'S^!^ "^^"^^ ^ SHc^eye?,lwe hivlfe^^ ; 

,,,,ilar.neuroan^ 
,,<^hlm^^ee,.the^►rm.yol^lme^^^^^ 

, orders ofmagmtude less than that of a chimp :- * ' ' " 

: .rand three ^ orders less than 'that of humans. - Yet . ■ 
the neuroanatomies of all three brains are strik- ■ ; 



; :v:ingly. similar,, and t^^^ 

. . ; of the pygmy marmoset are Httle different fix>m 
. those of chimpanzees. .Between humans . and 
, . chimpanzees, ibie gene nurnber, gene . structures 
: ..arid fiinctipns,: .chromosomsd. and ^ or-: 
:f::Saili2atioiis,;and c^^ tjpes.and neurpanatoniies 
vv£u^ .almos^^ the develop- ; 

: mental ; modifications, that predisposed human 
.'.lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive ^.singularity, that by even the 
sirnplest of . criteria made humans more com- 



:--^plex in a behavioral-sense. - 



can self-organize, but more junportant,= they can 
be particularly : robust. ';Tlus ;robustaess is not 
due to redundancy, but is a property , bf jbho- 
mogeneously wired rietworld. rThe errof 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
; few iiodes that contribute disproportionately to . 
• . network stability. Gene ;knockoiits jprovide . an i 
':-:V:iUustratipn.:;Some 

: -:Veffects, whereas others haVe Satastrbphic effects 
: f - on the system.':In the case of .vimentiEC a sup- 
^ : = posedly critical component 6f the cytoplasmic 
. intermediate filament network of mammals, the 
: -knockout of the gene in mice reveals them to be 
reproductively nonnal, with no pbW^^^ 



8.4 Genome complexity 

We will soon be in a positioin to iriove away 
from the cataloging of individual ^compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



Simple examination of the numberof neu: 
- rons, cell; types, or genes or of :the :genome 
, size does not alone account for the differenc- 
es ill complexity that we observe. Rather, it is 
the interactions witiiin and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
wonm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-p, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers : to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. . 



" r-^z.-'^rt^ yet^the^^^ 



vuous.-rvimentin -network; is'fcpriipletely absent 
' On the other 'hand, 1-30% W knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or ''bad" genes^ but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantiy evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity,'? particularly , because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, aind 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 
• It has been predicted for the last 1 5 years 
that complete sequencing of the human ge- 
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:n6me would open up ^new 'strategies for hu-, 
|-/man .biological research and:,.wbuld have a 
major impact on medicine, and through med- ' 
,.icine and public health, on society. Effects on 
. J biomedical research -are aheady being , felt. 
< -This assembly, of jthe^ human:vgenbme se- • 
:rrquence.isbut a^^.fc^ 

: ;>*.the. role of^the genome %human>iolo^! jt 
,;has been possible only : because ' of innpya-; 

=. ^tions;m instrumentation;and :spfWai^':that 
>;haveaillow^^ 

.pf.^.f process from' DNA preparation to an- 
notation/. The next steps are xlear:;;We must ' 
:;;define the complexity, that erisues .wlien this • 
,,; ,relatiyely modest set of about 30,000 genes is ^. 
.:■ expressed. \The sequence provides^the frame- :': 
.;• .work/upon which all tiie'genelicsvlbiochem-'*; 
; istiy, physiology, and ultimately Cphenotype ' 
■ depend. .It provides the.boimdaries for'scien- f ■ 
tific inquiry. The sequence is only,' the /fii^t ■ 
level of, understanding of the :.genome. . All 
genes and their .control elemeiats;, must be .. . 
identified; their fimctions, in concert as well 
as in isolation, defmed; their, sequence varia- ' 
tion worldwide described; and the . relation 
..between genome variation and specific phe- ■ 
notypic characteristics determined. Now we 
know what we have to explain. - 

Another paramount challenge .awaits: 

Jublic discussion of this^infpnnation and its 
otential for improvement of personal health, ' ■ 
..Many diverse sources of data' have ' shown ; 
that any two individuals are more than 99.9% 
identical in sequence, which means that , all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person . 
, are *'hard-wired". by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explam 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence; 
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» . .their . ethnic ^backgrounds. . Standard / blood bank ^ - 

• screens (saeening for HiV, hepatitis viruses, and so ' 
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laboratory prior to DNA extraction in the Celera 
laboratory. All samples that tested positive for 
transmissible viruses were Ineligible and were dis- 
carded. Karyotype analysis was performed on pe- 
ripheral blood lymphocytes from all samples select- 
ed for sequencing; all were normaL A two-staged 
consent process for prospective donors was em- 
ployed. The first stage of the consent process pro- 
vided information about the genome project, pro- 
cedures, and risks and benefits of participating. The 
second stage of the consent process Involved an- 
swering foUow-up questions and signing consent 
forms, and was conducted about 48 hours after the 
first 

33, DNA was Isolated from blood (773) or sperm. For ^ 
..:sperm, a washed, pellet (100 ftl) was lysed In a 
■ susperision (1 ml) containing 0.1 M NaCI, 10 mM 
tris-a-20 mM EDTA (pH 8). 1% SDS, 1 mg prbtein- 
ase K, and 10 mM dithlothreltol for 1 hour at 37"C ' 
The iysate was extracted with aqueous phenol and 
with phenol/chlorofonm. The DNA was ethanol pre- ' 
cipltated and dissolved In 1 ml TE buffer. To make ' 
genomic libraries, DNA was randomly sheared, end- 
polished with consecutive BAL31 nuclease and T4 
DNA polymerase treatments, and size-selected by 
electrophoresis on 1% low-melting-point agarose. 
After ligation to Bst XI adapters (Invftrogen. catalog 
no. N408-18). DNA was purified by three rounds of 
■. gel electrophoresis to remove excess adapters, and 
the fragments, now with 3'-CACA overhangs, were 



vector with 

. ■:...^^.3 -TCTG;oyerhangs; -Libraries : with"" three 'different 
; - average sizes of inserts' were constructed: 2. 10 and 
^ • 'in a 
;r4 <■ :high-copy .pUCI 8 derivative." The ^ 10- and 50-kbp 
-^^--. -/ragments .were cloned In a, rhedium-copy pBR322 
derivative: The 2-:. and 11 0-kbp. libraries yielded lini- 
. .J-forrn^sized . large - colonies on '•plating.^However.%e 
.AC/-:.-:v5PTkbp,Ubrarjes .produced- man^i'smalbco'lonies'^nd i 
> : *^«rts . were.unstabte.:;T6..remedy^th'is,Hhe^6^^^ ' 
,V/^;Um)ranes>ere.-digeste^ 
, ^,y.>:,:,deave .the ;:vector.v but .jgenerally idea ved^ 
>-:v> -A times within the.:50-kbp;ins*ert-^ 
.V.^,vi'fcanamycin,v.resistance <cassette;^i(pui^ified\^frqm 
:v.y;- ..fpUCK4;Amersham Phamiacia;^ 

- 0,1) was added and ligationiwas carried'out at 37';c 
, r . .Jn the contiiiual presence of Bgl II. As^Bgl'll-Bgl I) 
Ugations. occurred,. they. -were^cbntinually-cleaved, 

- -whereas Barri HI-Bgl II. ligations were' not deaved.'A 
high yield pfJnternally. deleted, drcular .library mol^^ 

..• t K ecules ,was .'obtained - in which ?the , residual Insert - 
^1:^^^ ends,,were .separated..:by.(theikanamycin cassette 
.<.,;;:/PNA..The, internally deleted libraries/when plated " 
• ^^'V^.^^°"^'"i"fi ?^P!cUlin:(50'itg/ml). cartenl- 
:::-/,f;, cilbn (50 tig/mj);and kanamydri" (15 fig/ml),:iiro: 

v^ -duced relatively uniform large colonies. The result- 
^v-iingdones could be preparedjor sequendng'Uslng 
the same procedures as dones from the 10-kbp 

.-..libraries. • . , 

34. Transformed, cells were plated.'on agar 'diffusion 
. : plates prepared with a fresh top layer containing no 
... . antibiotic poured on top of a previously set bottom 

- - . ; layer .containing excess antibiotic, to achieve the 

correct final concentratlon..This.method of plating v 
pemnitted the cells to develop antibiotic resistance 
before being exposed to antibiotic without the por ■ 
tential done bias that can be introduced through V 
Uquid outgrowth protocols. -.After ;colonies ^ had' t 
. grown. QBot (Cenetix, UK) automated c6lony-pitk--e 
ing robots were used to pick colonies meeting strin-v^' 
... v-gent sizcand -shape^eriaflnd to.^ndcula■te 384- 
;- -weU .rTiicrptlter plates .containing iiquld growth m^- > 
. Vf-. <Jium. . Uquid; cultures, were .Incubated overnight .• 

■ with shaking, and vyere, scored for growth before 
. . .. passing to template preparation; Template DNA was ' 
. extracted from liquid bacterial culture using a pro- ' . 
: , cedure based upon the alkaline lysis miniprep nieth- ' 
od (773) adapted for high throughput processing in 
.. . 384-weU mlcrotiter . plates. .Bacterial cells were 
. . lysed: cell debris was removed by centrifugatidn;" 

<and i)Usmid-.DNA -was recovered by Isopropanol 
, V v predpitatlon ,and /esuspended Jn .TO mM tris-HCI • 
: V buff er.- Reagent dispensing bperationis were accom- 
plished using Titertek MAP 8 liquid dispensing sys- . 
terns. Plate-to-plate Uquid transfers were performed 
using Tomtec Quadra 384 Model 320 pipetting ro- 
bots. All plates were tracked throughout processing 
by unique plate barcodes. Mated sequencing reads 
from opposite ends of each done Insert were ob- 
tained by preparirig two 384-weU <yde sequendng 
reaction plates from each plate of plasmid template 
DNA using ABI-PRISM Bigpye Tenmlnator chemistry 
(AppUed Biosystems) and standard M13 forward 
and reverse primers. Sequendng readions were pre- 
pared using the Tomtec Quadra 384-320 pipetting • 
robot Parent-child plate relationships and. by ex- 
tension, forward-reverse sequence mate pairs were 
established by automated plate barcode reading by 
the onboard barcode reader and were recorded by 
direct LIMS communicatjoa -Sequendng reaction . 
products Were purified by atcohol precipitation and 
were dried, sealed, and stored at 4*C Jri the djaric ..r. 
iintil needed for sequencing, at which time' the . 
reaction producU were resuspended In delonlzed 
formamide and sealed Immediately to prevent deg-.: i 
radatioa All sequence data were generated using a' 
single sequendng platform, the ABI PRISM 3700 ' ' ' 
DNA Analyzer. Sample sheets were created at load ^ 
time using a Java-based application that fadlitates ' ' 
barcode scanning of the sequendng plate barcode, 
retrieves sample Infonmation from the central IIMS. 
and reserves unique trace Identifiers, The applica- 
tion permitted a single sample sheet file In the 
linking directory and deleted previously created 
sample sheet files Immediately upon scanning of a 
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sample plate barcode, thus enhancing sample i . 
to-plate associations. . • , • 

So. 0.5 A 74, 5463 1977); J. M. Prober a/ 
Sc/ence 238, 336 (1987). . '""er er a/., 

, 36. Celera's computing environment b based on Com- 
; r:, paq Cpmpirter^Corporation's Alpha 
. y . ogy running , the, Tru64 .Unix%peratinT- S ' 
-'h Alphas , as bata'SerW^sS 

.. . nodes^aa Virtual Compute Fam,, all of ^ ch are 

: , """^«^<'.«o.a. fully switched network operatinrat 
.-...fast Ethernet speed (for theVCF) and g^ab |th 
: J , . erjiet speed (for data severa). Load ba^dng and 

: ' ^ ^^^^^^ Wireriien^ind prioriS ' 

- -^^^^ '^U ninnirig a? 667 

,v,;MHz.^a,la^le.n!e™o, on:these:iyste™^^«;»e^ - 
:. . V ^"-^ ^-^B S C,B. The VCF is used to manage t?fce : 
. . :;:. ?le pro«ss.in&.;and, annotation^ Genonie aLmbly , 
.. . ..vras perform^^ 

■ ■ Tn„ ""^5 ^O*" 3.2. CB of memoiy. A total of 

' . "0 ««"bytes of^hysical disk storage vJas included . 
In a Storage Area. Netvyork that was available to - 
..V ;systeijTs across,;^^ 
. avaKab,ljty.me anddat 
, ,v. ured js 4-node .Alpha TniClusters. so that servires 
. would fa,lovec,in the eventofhardwareors^S 
J^ZIT '"fher enhancTby 

: fW^D OV "^^^'-b^^d mirroring 

Jlfiri.''"'"''"^ f quality values for base 
calk by means of Paracel's TraceTuner. trims se- 

trard'^H*,"'"'*"* *° <l"^»«y values, trims v^- 
tor and adapter sequence from high-quality reads 
and screens sequences for contaminants. Similar In - 
• des^n and algorithm to the phred program 7^4)" 
TraceTuner reports quality values that reflert the . 
log-odds score, of each base being correct Read 
quality was evaluated in SO.bp windows, ea^^ead . • 
- : . being trimmed to .Include only those conseciithre . . 

^vt^'T-^'^t' -""I'^m mean accuracy of 
9796. End wtndows (both ends of the trace) of 1 5 • 

■ : and 50 bases were trimmed to a mlnimlim' 

: mean accura^ of 98%. Eve,y read . was fZ7r 

. decked for vector and contaminant matches of 50 

'"^ " "-as 'amoved 

from consideration. Finally, any match to the 5' 

^' (Na!^.^nT ^"'^•'"o'oay lnfom,ation 
39 -L?-fl?'" '^f1^''* ^•"d'l.nlm.nih.gov/.-,.-....,^^^- 
39. NCBI; available at www.ncbi.nlm.nih.gov/HTGS/. 

, 40. , All bactigs over 3..kbp were examined for coverage 
^ .by Celera mate pairs. An Interval of a baX w« 
deemed an assembly error where there were no ' 
mate pairs spanning the Interval and at least two . 
readstf,atshouldhavethelrmateontheot^frsWe 
of the Interval but did not In other words, there was 

bTeatooint't '^'ri "-"r'"* ' J^'""" 
breakpoint Interval and at least two mate pairs 

contradi«ing the Join. By thiscriterlon, vi^d^^^^^ 
and broke apart bactigs at 13.037 locations or 

•.. Lander-Waterman, statistic (775), the odds were 
■ , 0.99 or more that the assembly we produced wm 
Inconsistent with the sequence comi^j from a T 
gle source By this criterion, 714 or 2.2% of 
entries were deemed chimeric ■ ... 

^A^!l63S?^^^"^^-"'''-'^-^'''"P«- 

/n ^1''"' C<""P«af<«'af Metfto* 

?o^S,,;7?^89 ' Naw 

44. Deloukas et a/.. Scfence 282. 744 (1998). . 

45. M. A. Man-a et af., Genome Kes. 7. 1072 (1997). , ■ 

46. J. Zhang et St. data not shown .„,'' - 

47. Sh^.ded bactigs were located on long CSA scaf 
folds (>500 kbp) and the distribS of th«e 
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fragment on the scaffolds was analyzed. If the 
t?2c * g'-aater than four 

IZL tVr'*.!-'' BAC was con- 

.sidered to^ be chimeric In addition. If .>20% of 

• ■ ••• ? -l-i^i^'^'™-'^°*;a*c^^^ 5 

^,,;,,:.,-th5n.,heBAC,^^^^^^ 

■ - -V f°'- CSA gaw the,ni|nimal estimate of chimer- -: 

= 4^vM. Hattort efaA, Afa^ 

- f .49. I. Dunham etar.; mure 402,^89 h99^^^^^^^<^ 

> . .r Itpman,/ Mpl BioL 2^S, A6^ (1990) V ' ^/ ' ' ^ 
;v55a.M/OIMer.eraA;Scfence;29VU (2001).^-^^^^"^^ 
: 55b.See httpy/genome.ucscedu/. - • - . . - ■ 

V: ;57. ,D. Dickson, Afeture^Mi;3ii (1999) i^'V 

- l^woj. - f\ V. -■. . . v^V^v^^^ 

60. K Yandell, (n preparation/ ' -^ V ;V , ,^v;>>; = ' ' 

61. K; a Pruftt; K. V^Katz. H. Sicotte, a R: Maglott:: 
-Trends Cenet .'XS. 44 (2000). - ■ ' : \ . " 

.62. .Scaffolds, containing greater-than 10 kbp of se-' 
, % quence were^a for features of biotecal' 

; -importance through a series of computational steps. 
> and the results were stored in a relational database. 
For :Scaffolds,greater than one megabase, the se-- 
q^ence was cut into single megabase pieces before 
•^computational analysis. All sequence was masked ' 

gene find ng or homology-based analysis. The com- 
putational pipeline required -7 hours of CPU time ■ 
. per rnegabase. Including repeat masking, or a total 

- - compute time or about 20.000. CPU hoitrs.' Protein ?: 
' : .searches were performed against the nonredundant ' - 

..protein.database available at.the NCBI. Nucleotide ' i 
p^'i^V?'' P^rf°""^^ .against human, mouse.- /: 
TJa «t Gene Indices ^assemblies of cDNA : 
and EST^sequences), mouse genomic DNA reads " * 
generated at Celera (3x),tlie Ensembl gene data! 

tute (EBI). human and rodent (mouse and rat) EST ' 
data sets parsed from the dbEST database (NCBI) 

- ^^"pm1??V'** f^*^^-' ^^^^ experimental . 
mRNA database (NCBI),. Initial searches were per- ^ 

. formed onrepeat-njasked sequence with BLAST 2.0 ' 
' {54) optimizedjor: the; Compaq Alpha -compute-;:; 

BUSTN searches and 1 X 10^ for BIASTX searches. 
Additional processing of each query-subject pair 
"^Z if^T""^ aUgnments? AU pro- 
L V ^-4?'"'^ ^'^'"S expectation score of 
< 1 X 10 * human nucleotide BLAST results havinc 
an expectation score of <l x IQ-b with >94% 
Identity, and rodent nucleotide BLAST results having 
an expectation score of <1 x 10« with >80% 
identity were then examined on' the basis of their 
high-sconng pair (HSR) coordinates on the scaffold 
to remove redundant hits, retainlr^ hits that sup- 
ported possible .alternative spUci^g. For BLASTX 
searches, analysis was performed separately for se- 
lected model organisms (yeast mouse, human, C. 
elegans. and D. melanogaster) so as not to exclude 
HSPs from these organisms that support the wme 
gene structure. Sequences producing BLAST hits 
judged to be Informative, nonredundant, and suffi- 
,ciently similar to the scaffold sequence were then 
realigned to the genomic sequence with Sim4 for 
tSTs. and with Up for proteins. Because both of 
hese algorithms take splicing Into accounV the ^ 
resulting alignments usually give a better represen- ' - 
tation of Intron-exon boundaries than standard ' 
BLAST analyses and thus facilitate further annota- ^ - 
tion (both machine and human). In addition to the * 
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homology-based analysis described above, three ab 
6:1 '"'^°,fa"a prediction programs were used (63). 
, : , , 63. E. C Uberbacher, Y. Xu, R. f. MrfSl, Mstiio* fon,- 
. >r, /not , 266. .259 (1996);.C. Burge S ICallin / m^i 
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90,11995 (1993). „ . w.^-a 

' II ^ c; ^'-^ Ce/70/ne 11. 373 (2000) = 
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86. Y. Pan. W. K. Decker. A. H. H. M. Hui W. J. Craigen, 
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87. P. Nouvel, Cenetica 93, 191 (1994) ' ' . 

- ;io^672 S ^T'. ^' ^°"^^''^°"^vCe/^ome Res. 

,-89. Lek first compares all proteins in the- proteome to— 

' one .another.. Next.' the resulting BLAST reports are 
; , , ypaneiS, and a graph is created wherein each protein 
: V . constitutes a nbde; any hit between two proteins . 
with an expectation beneath a user-specified 
threshold constitutes an edge. Lek then uses this 
graph to compute a similarity between each protein 
pair ; In the context of the graph as a whole by 
simply dividing the number of BUVST hits shared In 
common between the two proteins by the total - 
number of proteins hit by / and/ This simple metric : 
has several Interesting properties. First, because the 
similarity metric takes Into account b^th the simi- - 
lanty and the differences between the two sequenc- 
er at the level of BL«T hits, the metric respects the 
rmiltidomain nature of protein space. Two multido- 
main proteins, for Instance, each containing do- 
mains A and B. will have a greater palrwise similarity 
to each other than either one will have to a protein 
containing only A or B domains. so long as A-B- 
containlng multidomain proteins are less frequent In 
the proteome than are single-domain proteins con- 
taining A or B domains. A second Interesting prop- 
erty of this similarity metric Is that it can be used to ' - ' " 
produce a similarity matrix for the proteome as a 
whole without having to first produce a multiple' ' ^ 
alignment for each protein family, an en-or-prone " ' 
and very time-consuming process. FinaUy, the met-" 
ric does not require that either sequence have sig- ' 
nificant homology to the other \n order to have a 
defined similarity to each other, only that they 
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share least'one significant BLAST hit in common. 

.This is an especially interesting prpperty of the 

"metric because it allows the rapid recoveiy of pro-; 

C tein families from the proteome for which no mul-; 

• tiple alignment is possible, thus providing a compu-.o*/; 

■ tationa! basis for the extension of protein horhology ; 

. searches beyond those of current HMM- and profile- - 
based search methods. Once the whole-proteome 
similarity matrix has been calculated, Uk first 'par> ;>, 



. probability of observing a duplicated set of. three ; 
genes -jn.two - different locations,, where vthe -three ; v< 
■yy- ..-genes occur aaoss a spread of five positions in both::.- 
: « Al locations, the .expected 'number ; of-^uch v ^ 
matched sets in the predicted protein 5et>is.approx-/.:=^' 
.imately {N)36/N^ :=^3S/N. a.value -^:!; •'Therefore, -;■ 
: any such duplications. of. three genes are unlikely to t.Cv;'^) 
..result from random rearrangements of the genome. If . 



! .bers,*and at least one from a rnulticellular.eu- . 
.. in : -!; karyote. the cluster was extended. For the extension 
j c Vvstep, a liidden Markov Model (HMM) was trained for 
r;v-.'ffthe>clukeri:"using:ithe*SAM s^ package, ver- 
..«.:(..?siori 2.'The.HMM was then scored against GenBank 
-i:; J'.hNR.(excluding mutants but including fragments for 
this step),'.and all sequences scoring better than a 
specific (NLUNUU.) score were added to the duster. 



*.V .'cany of the genes occur in more than' two copies, the .t;r>r;:f.<"'>The HMMiwas. then '.retrained {with -fixed model 
titions the proteome ;into single-linkage dusters- v. iprobabiUtyf that -the ?apparen and all sequences In the. cluster were aligned ..^ . • 

(27) on' the basis of one or more shared;BlAST hits ^i'^/v 'sOcurred by a rpultiple'seqiience align- ^^-i t 

between two sequences.'Next,- these single-linkage Mr?:^^^^^^^^^ 
duiteri -are 'hirther- partitioned Vint0 ;sub<ilusters.'i^^^^ 
eadi member of which shares a:u'ser-spedfied pair-i^:^fv94>:B/J: .Trask:cta/.^^^^^ 
wise^imilarity v;;ith the other members of the 6uS':rm^<^iSharon^etamenomi<^<^ 

her, as ;dw6ibed :above; For- the .purposes of ^ this ^4-^;^^ 95:1 W.:B;:BarbMuto;ar^^ vras*H^ 
■ publication, -we^ have-:focused-^n \the. 'Analysis of :^.v^^;i:fT;A.-Mdys^ig^ 
:single-linkage:vclusters^and whatiwe;have;tenmed.r^ 

rcomplete ::.duster5,i ! e.g^,^ those i subdusters for : ' ' :4i 1 (1 999). If ;V • : ; : . '.-r ; T:^f/J: r r ' ^-^ \ ^'7 • 
which cyer/meirnber has a similarity metric of 1 to 96: Reviewed in ;LfSkrabahek.':IC H. W^^ *- 
: every other member, of. the subduster. We' believe ' y;CtnetOev>B- GQA{[9^:S'^:^':v:^l^^^^ 
that the single-linTcage and complete dusters are of O-i'. i.:^97.* P. 'Taitton-Miller.'X^Cu, .Q.- U, 'L.'^Hillier,* P.-Y; Kwok,f /.c- 
ispedal lnterest,' in part, because they . allow us Xo-:::<i^:^>r)i^<ienome'.l(esXB;^A& (1998);ip:/Taillon-MiUer; E. :E 
^estimate and to compare sizes of core protein sets ' .'i}j;s^r :>;:-Piemot, P.^Y.'-KwoW;Cenome:/?e5. 3,f499.{19^^ 
'in a rigorous manner. 'The rationale for this Is as vl;i:9B.: D,=Altshuler et a/. ^ 

follows: if. one imagines for, a momenta perfect- r iV99.;,C. T. Marth ef a/.; Wature Genet. .23,-^452.(1W^^ 
dustering algorithm capable of perfectly, partition- .'^v -rlCW/.W.-H/U^^ 

ing one . or more perfectly annotated protein sets ' . MA. 1997). - c;.^ : r . 

Into protein families, it is reasonable to assume that - ;-101 .•: M- CargiU et a/.. Nati/re . Genet. 22, 231 (1999)./^ • s: > 
the number of dusters will always be greater than, - :--^ '102.- M. IC.Halushka et a/..*/Vati/reCeneL 22,239(^999). - . r 
or equal to, the number of slngle-Unkagfe dusters, ai -;103. J. Zhang,-T. L Madden. Cenome fies.'/. 649 (1997). .v-r^ 
because single-linkage dustering is a maximally ag- V./f .l04.>M. Nei, Mo/ecu/ar Bvolutsonary Cenet/a' (Columbia -^ a; 
" '* ■ • Univ. Press. New York, 1987), 



. .* glomerative clustering method. Thus, If there exists 
. a single protein in the predicted protein set contain- : 
ing domains A and B. then it will be dustered by ,. 
. single linkage together with all slngle^fomain pro- . 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multido- 
. main . protein, the number of real dusters , must ■• 
.. always be less than or equal to the number of 
. . complete dusters, because it is impossible to place 
'■■ 1.. : a unique multidomain protein into a complete clus- :l 
-V .^ter. Thus, the' single-linkage and.cornplele clusters. . 
. : ' plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisms' predicted protein set 
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